What are the languages that are shaping the world’s economy? Or in other words: “Do I really need to know more than English and if so, what?” The answer is going to be yes. So then I’ll ask which languages are in the best and worst position for natural language processing. Where do we need to go to work? (Theoretically, this post could also help you decide about what languages to learn but language learning works much better if you have passion and/or need.)
As a first swipe, let’s take a look at languages that are used in countries that have a 2012 GDP of at least $100 billion and have a 2011-2012 growth rate of at least 5%.
- Asia: China, India, Indonesia, Saudi Arabia, Thailand, Qatar, Kazakhstan, Kuwait, Vietnam, Bangladesh, Iraq
- Americas: Venezuela, Chile, Peru
- Africa: Nigeria, Angola, Libya
One of the first things that you should notice is that English is only minimally represented. English is probably most useful in India and Nigeria. This isn’t to say that English isn’t a major force in the world: of course it is. For what it’s worth, there are about 335m English speakers worldwide, about 430m if we include non-native speakers.
So what are the languages that go with the places above?
- Mandarin Chinese: China, Taiwan, Malaysia, Singapore; 848m speakers worldwide, 70% of speakers in China know it as a first language. There are other Chinese languages, of course. Consider powerhouse cities like Chengdu and Chongqing which have Sichuanese, or Foshan, Guangzhou, Hong Kong and Shenzhen where Cantonese is big; or Hangzhou and Shanghai with their Wu dialects. There are well over a million Chinese speakers in Thailand: but mostly Min Nan, not Mandarin. Hm. I’m going to need to do a whole post on Chinese, aren’t I?
- Spanish: Venezuela, Chile, Peru; Spain isn’t doing so well but Barcelona and Madrid are still among the richest cities in the world. Worldwide there are about 406m Spanish speakers.
- Hindi: India has 258m speakers. India is also a land of a tremendous amount of linguistic diversity. Telugu, Marathi, Tamil, Urdu, Gujarati, and Kannada each have over 35m speakers.
- Standard Arabic: Saudi Arabia, Qatar, Kuwait, Iraq, Libya. Fwiw, no one really speaks Standard Arabic and each of these countries essentially has its own Arabic dialect. Worldwide there are 206m people who speak some Arabic variety as a first language.
- Portuguese: The official language of Angola, as well as Brazil and of course of Portugal, whose capital Lisbon is one of the richest cities in the world. There are 202m speakers of Portugal worldwide. Back in Angola, other major languages include Umbundu (6m) and Kimbundu (4m).
- Bengali: 110m speakers in Bangladesh, 82.5m in India.
- Vietnamese: There are 66m in Vietnam.
- Malay-Indonesian: Indonesia and Thailand (and obviously Malaysia, but Malaysia doesn’t have the GDP growth to qualify here). It’s tricky to decide what exactly to count—official forms, which distinguish Standard Malay from Indonesian, or something else. Let’s just call it 40m speakers and know that it’s probably a low-ball estimate.
- Thai: 20.2m speakers in Thailand.
- Nigeria has 522 living languages. English is the national language, but various regions are dominated by Hausa (18.5m speakers in country), Igbo (18m), Yoruba (18.9m); there are also a lot of Nigerian Fulfulde speakers (11.5m); note that Nigerian Pidgin is spoken by 30m people. You really need to listen to it.
- Kazakh: 5.3m in Kazakhstan, 1.3m in China.
- Central Kurdish: 3.5m in Iraq, another 3.25m in Iran, which is also doing fairly well.
Let’s extend our net a little further. We’ll consider languages spoken in cities that are among the richest and the fastest growing (in terms of GDP). We’ll also consider the languages of countries that have at least $30b in 2012 GDP and at least a 3% growth in that number since 2011. Furthermore, we’ll restrict ourselves to languages that, in our areas of interest, have at least 3m speakers. That gives us 107 languages in 71 countries (recall that there are about 7,000 languages in the world today).
How NLPable are these languages?
Wikipedia offers a handy proxy for measuring how NLPable a language is: the more pages a language has in Wikipedia, the easier it is likely to be to get started working on the language. Of our 107 Economic Powerhouse languages, 27 have 100,000 or more pages in Wikipedia (as you might guess, European languages dominate here). Another 20 languages have 10,000-100,000 pages. 24 languages have fewer than 10,000 Wikipedia pages. Before I tell you how many have none, I need to exclude the 8 spoken varieties of Arabic in our data set because it’s conventional to write in Standard Arabic (and there are 225,000 pages in Standard Arabic). We might also remove varieties of Thai, Italian, and German (9 total). After doing that, there are still 19 of the 107 Powerhouse Languages without any Wikipedia pages at all (that’s 18%).
For reference, the highest Wikipedia pages per speaker ratios can be found for European languages (Swedish, Dutch, Norwegian, Danish, Czech, Polish, Hungarian, French, German, Italian) and some Asian ones (Kazakh, Hebrew, Tagalog, Cebuano, Malay). There’s terrible representation for African languages (Sesotho, Twsana, Tigrinya, Zulu, Igbo, Oromo, Xhosa, Hausa, Fula, Kanuri) and some South Asian languages (Bengali, Oriya, Punjabi, Sindhi).
It’s much easier to do NLP when there are ample resources already digitalized (and even better if they are collected and organized). Some resources/references on languages and NLP include LREC and IJCNLP.
We can also see what kind of research support there is for various languages by going to Google Scholar and searching for the language name plus “nlp” (natural language processing). We can then compare search results to total number of speakers, to pages on Wikipedia, and the number of searches with “nlp” alone (156,000). Here I’ll restrict myself to just the Powerhouse Languages that that have at least 10,000 Wikipedia pages. Looking among these languages, it is obvious that English is the top position. By a lot. Relatively speaking, German, French, Japanese, Italian, Czech, Greek, Norwegian, Korean, Thai, Danish, and Hebrew are among the best researched. The languages in the worst shape are Vietnamese, Malayalam, Azerbaijani, Kazakh, Tagalog, Belarusian, Gujarati, Uzbek, Kurdish, Yoruba, Cebuano, and Javanese.
The main take-away is that if you are doing work with a global perspective and you’re only paying attention to English you are not alone. But you are missing enormous opportunities. Depending upon what you’re trying to do, some of these languages will be more interesting than others. While it’s true that the number of G8 languages is relatively small and gets you pretty far, it’s obvious that most of the world’s communication is happening in a much more diverse set of languages. If you think globally, then it’s probably a pretty safe bet that whether you care about what people are speaking at the office or at home, the languages you need to be thinking about are more numerous (and less well known) than you suspect.
– Tyler Schnoebelen