Archive | May, 2013

Economic powerhouse languages

30 May

[sociable/]

What are the languages that are shaping the world’s economy? Or in other words: “Do I really need to know more than English and if so, what?” The answer is going to be yes. So then I’ll ask which languages are in the best and worst position for natural language processing. Where do we need to go to work? (Theoretically, this post could also help you decide about what languages to learn but language learning works much better if you have passion and/or need.)

As a first swipe, let’s take a look at languages that are used in countries that have a 2012 GDP of at least $100 billion and have a 2011-2012 growth rate of at least 5%.

  • Asia: China, India, Indonesia, Saudi Arabia, Thailand, Qatar, Kazakhstan, Kuwait, Vietnam, Bangladesh, Iraq
  • Americas: Venezuela, Chile, Peru
  • Africa: Nigeria, Angola, Libya

One of the first things that you should notice is that English is only minimally represented. English is probably most useful in India and Nigeria. This isn’t to say that English isn’t a major force in the world: of course it is. For what it’s worth, there are about 335m English speakers worldwide, about 430m if we include non-native speakers.

So what are the languages that go with the places above?

  • Mandarin Chinese: China, Taiwan, Malaysia, Singapore; 848m speakers worldwide, 70% of speakers in China know it as a first language. There are other Chinese languages, of course. Consider powerhouse cities like Chengdu and Chongqing which have Sichuanese, or Foshan, Guangzhou, Hong Kong and Shenzhen where Cantonese is big; or Hangzhou and Shanghai with their Wu dialects. There are well over a million Chinese speakers in Thailand: but mostly Min Nan, not Mandarin. Hm. I’m going to need to do a whole post on Chinese, aren’t I?
  • Spanish: Venezuela, Chile, Peru; Spain isn’t doing so well but Barcelona and Madrid are still among the richest cities in the world. Worldwide there are about 406m Spanish speakers.
  • Hindi: India has 258m speakers. India is also a land of a tremendous amount of linguistic diversity. Telugu, Marathi, Tamil, Urdu, Gujarati, and Kannada each have over 35m speakers.
  • Standard Arabic: Saudi Arabia, Qatar, Kuwait, Iraq, Libya. Fwiw, no one really speaks Standard Arabic and each of these countries essentially has its own Arabic dialect. Worldwide there are 206m people who speak some Arabic variety as a first language.
  • Portuguese: The official language of Angola, as well as Brazil and of course of Portugal, whose capital Lisbon is one of the richest cities in the world. There are 202m speakers of Portugal worldwide. Back in Angola, other major languages include Umbundu (6m) and Kimbundu (4m).
  • Bengali: 110m speakers in Bangladesh, 82.5m in India.
  • Vietnamese: There are 66m in Vietnam.
  • Malay-Indonesian: Indonesia and Thailand (and obviously Malaysia, but Malaysia doesn’t have the GDP growth to qualify here). It’s tricky to decide what exactly to count—official forms, which distinguish Standard Malay from Indonesian, or something else. Let’s just call it 40m speakers and know that it’s probably a low-ball estimate.
  • Thai: 20.2m speakers in Thailand.
  • Nigeria has 522 living languages. English is the national language, but various regions are dominated by Hausa (18.5m speakers in country), Igbo (18m), Yoruba (18.9m); there are also a lot of Nigerian Fulfulde speakers (11.5m); note that Nigerian Pidgin is spoken by 30m people. You really need to listen to it.
  • Kazakh: 5.3m in Kazakhstan, 1.3m in China.
  • Central Kurdish: 3.5m in Iraq, another 3.25m in Iran, which is also doing fairly well.

Let’s extend our net a little further. We’ll consider languages spoken in cities that are among the richest and the fastest growing (in terms of GDP). We’ll also consider the languages of countries that have at least $30b in 2012 GDP and at least a 3% growth in that number since 2011. Furthermore, we’ll restrict ourselves to languages that, in our areas of interest, have at least 3m speakers. That gives us 107 languages in 71 countries (recall that there are about 7,000 languages in the world today).

How NLPable are these languages?

Wikipedia offers a handy proxy for measuring how NLPable a language is: the more pages a language has in Wikipedia, the easier it is likely to be to get started working on the language. Of our 107 Economic Powerhouse languages, 27 have 100,000 or more pages in Wikipedia (as you might guess, European languages dominate here). Another 20 languages have 10,000-100,000 pages. 24 languages have fewer than 10,000 Wikipedia pages. Before I tell you how many have none, I need to exclude the 8 spoken varieties of Arabic in our data set because it’s conventional to write in Standard Arabic (and there are 225,000 pages in Standard Arabic). We might also remove varieties of Thai, Italian, and German (9 total). After doing that, there are still 19 of the 107 Powerhouse Languages without any Wikipedia pages at all (that’s 18%).

For reference, the highest Wikipedia pages per speaker ratios can be found for European languages (Swedish, Dutch, Norwegian, Danish, Czech, Polish, Hungarian, French, German, Italian) and some Asian ones (Kazakh, Hebrew, Tagalog, Cebuano, Malay). There’s terrible representation for African languages (Sesotho, Twsana, Tigrinya, Zulu, Igbo, Oromo, Xhosa, Hausa, Fula, Kanuri) and some South Asian languages (Bengali, Oriya, Punjabi, Sindhi).

Wikipedia languages per speaker (natural log scale)

It’s much easier to do NLP when there are ample resources already digitalized (and even better if they are collected and organized). Some resources/references on languages and NLP include LREC and IJCNLP.

We can also see what kind of research support there is for various languages by going to Google Scholar and searching for the language name plus “nlp” (natural language processing). We can then compare search results to total number of speakers, to pages on Wikipedia, and the number of searches with “nlp” alone (156,000). Here I’ll restrict myself to just the Powerhouse Languages that that have at least 10,000 Wikipedia pages. Looking among these languages, it is obvious that English is the top position. By a lot. Relatively speaking, German, French, Japanese, Italian, Czech, Greek, Norwegian, Korean, Thai, Danish, and Hebrew are among the best researched. The languages in the worst shape are Vietnamese, MalayalamAzerbaijaniKazakh, Tagalog, Belarusian, Gujarati, Uzbek, Kurdish, Yoruba, Cebuano, and Javanese.

The main take-away is that if you are doing work with a global perspective and you’re only paying attention to English you are not alone. But you are missing enormous opportunities. Depending upon what you’re trying to do, some of these languages will be more interesting than others. While it’s true that the number of G8 languages is relatively small and gets you pretty far, it’s obvious that most of the world’s communication is happening in a much more diverse set of languages. If you think globally, then it’s probably a pretty safe bet that whether you care about what people are speaking at the office or at home, the languages you need to be thinking about are more numerous (and less well known) than you suspect.

– Tyler Schnoebelen

[sociable/]

Opinionated tweets

28 May

Luo, Osborne and Wang make the following data set available:

https://sourceforge. net/projects/ortwitter/

They crawled 30 million English-language tweets and then had 7 people use a search engine to call up results. The results showed 100 tweets and the people had to classify each of the 100 for whether it was (a) opinionated about the query, or (b) not opinionated.

There were 50 queries resulting in 5,000 annotated tweets.

Read their paper here:

Opinion Retrieval in Twitter: http://homepages.inf.ed.ac.uk/miles/papers/icwsm12.pdf

GIF pronunciations and the CMU Pronuncing Dictionary

23 May

The CMU Pronouncing Dictionary offers us the chance to see how many ways there are to pronounce “g” in English. Should it be hard-g GIF or soft-g JIF? (There are 8+ pronunciations of “g”!)

http://idibon.com/gif-and-ways-to-say-g/

Crowdsourcing and corpus studies

23 May

One of the things you might want to use crowdsourcing for is to annotate or create corpora.

You can read about crowdsourcing techniques in linguistics in this paper:

Using crowdsourcing for linguistic research by Tyler Schnoebelen and Victor Kuperman

Or see a bunch of different linguistic research projects that used crowdsourcing (presented at the Linguistic Society of America’s annual conference):

LSA 2011 presentation

And you can read an analysis of using crowdsourcing to help assess damage from Hurricane Sandy here:

http://idibon.com/crowdsourced-hurricane-sandy-response/

GIF: The jood, the bad, and ujly

22 May

Originally written with Rob Munro for Idibon.com, thanks, Rob!

[sociable/]

“Id” “ee” “bon”

Pronunciation matters. But to no one so much as Steve Wilhite, it seems. The inventor of the GIF graphics format accepted a lifetime achievement award at yesterday’s Webby’s by flashing this on the screen:

“It’s pronounced JIF, not GIF.”

The mismatch between sounds and symbols matching is famously complex and controversial.

To be fair to Wilhite, a purely empirical approach to guessing the name would have led language technologists like us to get it wrong. It would also have made it seem more complicated. From a quick lookup of the CMU Pronouncing Dictionary you’ll see there are not two but six pronunciations for words starting with “g”:

  • /g/: the hard “g” like glide or galore (n=4,956)
  • /jh/: the soft “j” like gelatin or Gemini (n=686)
  • /zh/: fricatives like genre or Giselle (n=32)
  • /n/: sort of skip the “g” in favor of an /n/ like gnarly and gnash (n=28)
  • /hh/: softer still, like Gerlado, Geraldi (n=2)
  • /k/: the unvoiced alternate to “g”, as in Ghadafi (n=1)

So if you were just running the odds, then you’d bet g-if was over seven times more likely than jh-if (86.9% vs. 12.0%). If we just look for words that start off with “gi” it is closer but still favors g-if (58.4% vs. 38.4%), with j-if mostly from Italian names (Giacomo, Giovanni) and Giraffes.

And that was just word-initial “g”.

Inside words, it gets more complex (switching to the International Phonetic Alphabet):

  • ŋ: sing, complaining, English
  • f: tough, enough, roughneck
  • aɪ: height, alight, align
  • ɔ: daughter, afterthought, naught
  • eɪ: featherweight, campaign, sleigh
  • aʊ: bough, drought, plough
  • ju: Hugh, impugn
  • oʊ: although, cologne, furlough

And there are a few real outliers:

  • foreign
  • diaphragm
  • lasagna
  • imbroglio
  • O’Donoghue
  • Voigt
  • Nguyen

We pulled these last six out of the corpus easily enough, but in the interest of time (mainly our own) we’ll leave the phonological analysis to other people. We also skipped some ough variations, because we couldn’t do it better than Dr. Seuss:

The Tough Coughs As He Ploughs the Dough

Just try to hold “ough” constant.

Why sound-symbol mismatch is not all bad

If nothing else, these examples should have shown why speech recognition is a non-trivial task. Full respect to the people who have made technologies like Siri a reality—it builds on decades of work.

However, the lack of a one-to-one mapping between sounds and symbols is not always a bad thing. In some contexts, it’s unambiguous (you wouldn’t turn good into ‘jood’ or ugly into ‘ujly’ if you had a simple dictionary and/or knew the context).

In other cases, the mismatch is downright helpful. The English plural is typically the /z/ sound (‘dogz’, ‘tablez’, ‘carz’) and only /s/ in certain contexts (‘cats’, ‘lamps’, ‘bikes’). By standardizing the spelling, even when the pronunciation changes, it’s easier for both humans and machines to understand the written text.

It’s the same for verbs. When you add ‘-ing’ to ‘sigh’, you are pronouncing it ‘sigh-ying’, adding the glide ‘y’ because of English’s preference against adjacent vowels (also looks like another ‘g’ pronunciation). That doesn’t really give us any more information about the meaning of the word, so it’s simpler to write and read in a way that happens to be more consistent than the way we actual speak.

– Tyler Schnoebelen and Rob Munro, your #1 g’s

[sociable/]

Corpus linguistics and the NBA playoffs

21 May

In honor of the NBA Draft Lottery, some facts about the vagaries of three synonymous-looking terms: basketball, hoops, and bball.

http://idibon.com/bball-and-hoops-when-do-synonyms-matter/

Are basketball, bball, and hoops really synonyms? From http://idibon.com/bball-and-hoops-when-do-synonyms-matter/

 

 

Discovering linguistic diversity

20 May

Over at the Idibon blog a couple posts that talk about how languages do stuff.

First, some of our favorite things about indigenous languages of the US and Canada:

http://idibon.com/powwow-5-facts-about-native-american-languages/

(The origin of “powwow” and Havasupai pronoun fun, Cherokee verbs and more.)

And using a corpus of movie subtitles, an analysis of a single line from film noir in French, Hungarian, and Turkish.

http://idibon.com/the-multilingual-falcon/

The Maltese Falcon, analyzed in French, Hungarian, and Turkish at http://idibon.com/the-multilingual-falcon/