May | 2013 | Corpus linguistics

Archive | May, 2013

Economic powerhouse languages

[sociable/]

What are the languages that are shaping the world’s economy? Or in other words: “Do I really need to know more than English and if so, what?” The answer is going to be yes. So then I’ll ask which languages are in the best and worst position for natural language processing. Where do we need to go to work? (Theoretically, this post could also help you decide about what languages to learn but language learning works much better if you have passion and/or need.)

As a first swipe, let’s take a look at languages that are used in countries that have a 2012 GDP of at least $100 billion and have a 2011-2012 growth rate of at least 5%.

Asia: China, India, Indonesia, Saudi Arabia, Thailand, Qatar, Kazakhstan, Kuwait, Vietnam, Bangladesh, Iraq
Americas: Venezuela, Chile, Peru
Africa: Nigeria, Angola, Libya

One of the first things that you should notice is that English is only minimally represented. English is probably most useful in India and Nigeria. This isn’t to say that English isn’t a major force in the world: of course it is. For what it’s worth, there are about 335m English speakers worldwide, about 430m if we include non-native speakers.

So what are the languages that go with the places above?

Mandarin Chinese: China, Taiwan, Malaysia, Singapore; 848m speakers worldwide, 70% of speakers in China know it as a first language. There are other Chinese languages, of course. Consider powerhouse cities like Chengdu and Chongqing which have Sichuanese, or Foshan, Guangzhou, Hong Kong and Shenzhen where Cantonese is big; or Hangzhou and Shanghai with their Wu dialects. There are well over a million Chinese speakers in Thailand: but mostly Min Nan, not Mandarin. Hm. I’m going to need to do a whole post on Chinese, aren’t I?
Spanish: Venezuela, Chile, Peru; Spain isn’t doing so well but Barcelona and Madrid are still among the richest cities in the world. Worldwide there are about 406m Spanish speakers.
Hindi: India has 258m speakers. India is also a land of a tremendous amount of linguistic diversity. Telugu, Marathi, Tamil, Urdu, Gujarati, and Kannada each have over 35m speakers.
Standard Arabic: Saudi Arabia, Qatar, Kuwait, Iraq, Libya. Fwiw, no one really speaks Standard Arabic and each of these countries essentially has its own Arabic dialect. Worldwide there are 206m people who speak some Arabic variety as a first language.
Portuguese: The official language of Angola, as well as Brazil and of course of Portugal, whose capital Lisbon is one of the richest cities in the world. There are 202m speakers of Portugal worldwide. Back in Angola, other major languages include Umbundu (6m) and Kimbundu (4m).
Bengali: 110m speakers in Bangladesh, 82.5m in India.
Vietnamese: There are 66m in Vietnam.
Malay-Indonesian: Indonesia and Thailand (and obviously Malaysia, but Malaysia doesn’t have the GDP growth to qualify here). It’s tricky to decide what exactly to count—official forms, which distinguish Standard Malay from Indonesian, or something else. Let’s just call it 40m speakers and know that it’s probably a low-ball estimate.
Thai: 20.2m speakers in Thailand.
Nigeria has 522 living languages. English is the national language, but various regions are dominated by Hausa (18.5m speakers in country), Igbo (18m), Yoruba (18.9m); there are also a lot of Nigerian Fulfulde speakers (11.5m); note that Nigerian Pidgin is spoken by 30m people. You really need to listen to it.
Kazakh: 5.3m in Kazakhstan, 1.3m in China.
Central Kurdish: 3.5m in Iraq, another 3.25m in Iran, which is also doing fairly well.

Let’s extend our net a little further. We’ll consider languages spoken in cities that are among the richest and the fastest growing (in terms of GDP). We’ll also consider the languages of countries that have at least $30b in 2012 GDP and at least a 3% growth in that number since 2011. Furthermore, we’ll restrict ourselves to languages that, in our areas of interest, have at least 3m speakers. That gives us 107 languages in 71 countries (recall that there are about 7,000 languages in the world today).

How NLPable are these languages?

Wikipedia offers a handy proxy for measuring how NLPable a language is: the more pages a language has in Wikipedia, the easier it is likely to be to get started working on the language. Of our 107 Economic Powerhouse languages, 27 have 100,000 or more pages in Wikipedia (as you might guess, European languages dominate here). Another 20 languages have 10,000-100,000 pages. 24 languages have fewer than 10,000 Wikipedia pages. Before I tell you how many have none, I need to exclude the 8 spoken varieties of Arabic in our data set because it’s conventional to write in Standard Arabic (and there are 225,000 pages in Standard Arabic). We might also remove varieties of Thai, Italian, and German (9 total). After doing that, there are still 19 of the 107 Powerhouse Languages without any Wikipedia pages at all (that’s 18%).

For reference, the highest Wikipedia pages per speaker ratios can be found for European languages (Swedish, Dutch, Norwegian, Danish, Czech, Polish, Hungarian, French, German, Italian) and some Asian ones (Kazakh, Hebrew, Tagalog, Cebuano, Malay). There’s terrible representation for African languages (Sesotho, Twsana, Tigrinya, Zulu, Igbo, Oromo, Xhosa, Hausa, Fula, Kanuri) and some South Asian languages (Bengali, Oriya, Punjabi, Sindhi).

It’s much easier to do NLP when there are ample resources already digitalized (and even better if they are collected and organized). Some resources/references on languages and NLP include LREC and IJCNLP.

We can also see what kind of research support there is for various languages by going to Google Scholar and searching for the language name plus “nlp” (natural language processing). We can then compare search results to total number of speakers, to pages on Wikipedia, and the number of searches with “nlp” alone (156,000). Here I’ll restrict myself to just the Powerhouse Languages that that have at least 10,000 Wikipedia pages. Looking among these languages, it is obvious that English is the top position. By a lot. Relatively speaking, German, French, Japanese, Italian, Czech, Greek, Norwegian, Korean, Thai, Danish, and Hebrew are among the best researched. The languages in the worst shape are Vietnamese, Malayalam, Azerbaijani, Kazakh, Tagalog, Belarusian, Gujarati, Uzbek, Kurdish, Yoruba, Cebuano, and Javanese.

The main take-away is that if you are doing work with a global perspective and you’re only paying attention to English you are not alone. But you are missing enormous opportunities. Depending upon what you’re trying to do, some of these languages will be more interesting than others. While it’s true that the number of G8 languages is relatively small and gets you pretty far, it’s obvious that most of the world’s communication is happening in a much more diverse set of languages. If you think globally, then it’s probably a pretty safe bet that whether you care about what people are speaking at the office or at home, the languages you need to be thinking about are more numerous (and less well known) than you suspect.

– Tyler Schnoebelen

[sociable/]

Comments Leave a Comment
Categories Uncategorized

Opinionated tweets

28 May

Luo, Osborne and Wang make the following data set available:

https://sourceforge. net/projects/ortwitter/

They crawled 30 million English-language tweets and then had 7 people use a search engine to call up results. The results showed 100 tweets and the people had to classify each of the 100 for whether it was (a) opinionated about the query, or (b) not opinionated.

There were 50 queries resulting in 5,000 annotated tweets.

Read their paper here:

Opinion Retrieval in Twitter: http://homepages.inf.ed.ac.uk/miles/papers/icwsm12.pdf

Comments Leave a Comment
Categories Uncategorized

GIF pronunciations and the CMU Pronuncing Dictionary

23 May

The CMU Pronouncing Dictionary offers us the chance to see how many ways there are to pronounce “g” in English. Should it be hard-g GIF or soft-g JIF? (There are 8+ pronunciations of “g”!)

http://idibon.com/gif-and-ways-to-say-g/

Comments Leave a Comment
Categories Uncategorized

Crowdsourcing and corpus studies

23 May

One of the things you might want to use crowdsourcing for is to annotate or create corpora.

You can read about crowdsourcing techniques in linguistics in this paper:

Using crowdsourcing for linguistic research by Tyler Schnoebelen and Victor Kuperman

Or see a bunch of different linguistic research projects that used crowdsourcing (presented at the Linguistic Society of America’s annual conference):

LSA 2011 presentation

And you can read an analysis of using crowdsourcing to help assess damage from Hurricane Sandy here:

http://idibon.com/crowdsourced-hurricane-sandy-response/

Comments Leave a Comment
Categories Uncategorized

GIF: The jood, the bad, and ujly

22 May

Originally written with Rob Munro for Idibon.com, thanks, Rob!

Pronunciation matters. Including Steve Wilhite, it seems. The inventor of the GIF graphics format accepted a lifetime achievement award at yesterday’s Webby’s by flashing this on the screen:

“It’s pronounced JIF, not GIF.”

The mismatch between sounds and symbols matching is famously complex and controversial.

To be fair to Wilhite, a purely empirical approach to guessing the name would have led language technologists like us to get it wrong. It would also have made it seem more complicated. From a quick lookup of the CMU Pronouncing Dictionary you’ll see there are not two but six pronunciations for words starting with “g”:

/g/: the hard “g” like glide or galore (n=4,956)
/jh/: the soft “j” like gelatin or Gemini (n=686)
/zh/: fricatives like genre or Giselle (n=32)
/n/: sort of skip the “g” in favor of an /n/ like gnarly and gnash (n=28)
/hh/: softer still, like Gerlado, Geraldi (n=2)
/k/: the unvoiced alternate to “g”, as in Ghadafi (n=1)

So if you were just running the odds, then you’d bet g-if was over seven times more likely than jh-if (86.9% vs. 12.0%). If we just look for words that start off with “gi” it is closer but still favors g-if (58.4% vs. 38.4%), with j-if mostly from Italian names (Giacomo, Giovanni) and Giraffes.

And that was just word-initial “g”.

Inside words, it gets more complex (switching to the International Phonetic Alphabet):

ŋ: sing, complaining, English
f: tough, enough, roughneck
aɪ: height, alight, align
ɔ: daughter, afterthought, naught
eɪ: featherweight, campaign, sleigh
aʊ: bough, drought, plough
ju: Hugh, impugn
oʊ: although, cologne, furlough

And there are a few real outliers:

foreign
diaphragm
lasagna
imbroglio
O’Donoghue
Voigt
Nguyen

We pulled these last six out of the corpus easily enough, but in the interest of time (mainly our own) we’ll leave the phonological analysis to other people. We also skipped some ough variations, because we couldn’t do it better than Dr. Seuss:

Just try to hold “ough” constant.

Why sound-symbol mismatch is not all bad

If nothing else, these examples should have shown why speech recognition is a non-trivial task. Full respect to the people who have made technologies like Siri a reality—it builds on decades of work.

However, the lack of a one-to-one mapping between sounds and symbols is not always a bad thing. In some contexts, it’s unambiguous (you wouldn’t turn good into ‘jood’ or ugly into ‘ujly’ if you had a simple dictionary and/or knew the context).

In other cases, the mismatch is downright helpful. The English plural is typically the /z/ sound (‘dogz’, ‘tablez’, ‘carz’) and only /s/ in certain contexts (‘cats’, ‘lamps’, ‘bikes’). By standardizing the spelling, even when the pronunciation changes, it’s easier for both humans and machines to understand the written text.

It’s the same for verbs. When you add ‘-ing’ to ‘sigh’, you are pronouncing it ‘sigh-ying’, adding the glide ‘y’ because of English’s preference against adjacent vowels (also looks like another ‘g’ pronunciation). That doesn’t really give us any more information about the meaning of the word, so it’s simpler to write and read in a way that happens to be more consistent than the way we actual speak.

– Tyler Schnoebelen and Rob Munro, your #1 g’s

Comments Leave a Comment
Categories Uncategorized

Corpus linguistics and the NBA playoffs

21 May

In honor of the NBA Draft Lottery, some facts about the vagaries of three synonymous-looking terms: basketball, hoops, and bball.

http://idibon.com/bball-and-hoops-when-do-synonyms-matter/

Comments Leave a Comment
Categories Uncategorized

Discovering linguistic diversity

20 May

Over at the Idibon blog a couple posts that talk about how languages do stuff.

First, some of our favorite things about indigenous languages of the US and Canada:

http://idibon.com/powwow-5-facts-about-native-american-languages/

(The origin of “powwow” and Havasupai pronoun fun, Cherokee verbs and more.)

And using a corpus of movie subtitles, an analysis of a single line from film noir in French, Hungarian, and Turkish.

http://idibon.com/the-multilingual-falcon/

Tags: Ahtna, Algonquian, Anishinaabemowin, Carrier, Cherokee, Eastern Abenaki, Eastern Ojibwa, French, Havasupai, Hungarian, Illinois, Inuit, Koyukon, Lakhota, Lakota, Massachusett, Narragansett, Powhatan, Tanana, Turkish, Virginia Algonquian, Yupik

Comments Leave a Comment
Categories Uncategorized

Top pop songs corpus

18 May

Over on the Idibon blog, an analysis of 122 years of pop song hits. Focusing on love (and the loss of love from song titles in recent years, ack!)

http://idibon.com/weve-lost-that-lovin-feelin/

Comments Leave a Comment
Categories Uncategorized

How the world communicates

17 May

Over on the Idibon blog, some info about all the ways the world communicates:

http://idibon.com/idibon-at-strata/

Three factoids from the post:

Every three months the amount of text messages sent equals all books ever published
If Facebook’s “Like” was a 1-word language, it’d be in the top 5% of the most widely spoken languages in the world
If we had recorded every word that every human had ever spoken in text, it would take up less than 1% of the world’s current digital storage capacity (about 50 exabytes, assuming 110B people have averaged 16,000 words a day for 20 years each…uh, of course these are bold assumptions!)

Comments Leave a Comment
Categories Uncategorized

The Multilingual Falcon

16 May

Every language has one. The kind of hot thing that rolls off a native tongue all sweet, but presses into your own ear jagged, curling your hair and making your skin itch. Some kind of clitic that ends a party, a string of morphemes that’s chin music. An optative or an elative come out and you wish you could get the hell out. Even meek language learners feel a savageness when the strangeness comes around.

How much is lost in translation when we try to process only in English? Perhaps 90% of academic and commercial Natural Language Processing has focused only on English. If you are trying to find broad topics this might not matter, but if you are trying to identify all the subtle (or not so subtle) metaphors, sentiment and emotion, translating into English will often strip away the very phenomena you are most interested in.

Translation	Literal meaning	Machine translation
L’étoffe dont sont faits les rêves.	“The material of which are made the dreams”.	“The stuff that dreams are made of”
Dolgok, amikről álmodunk.	“things, about which we dream”.	“Things I dream of.”
Rüyalarin yapildigi maddeden.	“dreams’ were-made material-of”.	“Material dreams are made.”
Translations for “The stuff that dreams are made of” in French, Hungarian and Turkish

In the examples above, how much of the full impact “the stuff that dreams are made of” is lost in translation? Only French machine translation turns it back into the correct English, but we suspect that this is because it knows the famous quote. Imagine the full range of expressions in English that would lose their punch when translated: “bare your heart”, “give up”, “beside yourself”, and realize that every single one of the world’s languages has an equally rich set of expressions and idioms that cannot be adequately translated, by humans or machines.

This is why we need intelligent Natural Language Processing that works within each language, not just with translations: it is often the most emotionally charged expressions that cannot be translated.

For this post we’ll break down this one example, taken from the most famous line from The Maltese Falcon. We choose this among all possible idioms or expressions for reasons close to our hearts: a month or so ago we moved into our new offices five floors above where the author Dashiell Hammett worked as a private eye.

A police detective picks up the Maltese Falcon statue and notes how heavy it is. “What is it?” he asks Sam Spade.

The, uh, stuff that dreams are made of.

Let’s take the lid off and see the works. We’re going to use the translations that actually appear in subtitles, courtesy of OpenSubtitles via Jörg Tiedemann’s OPUS corpus. (None of them choose to translate the uh, which is a bit sad since it’s one of the stronger stylistic markers). The machine translations are from a well-known search engine.

French

We’ll start with a pretty easy one. French is a broadly spoken language and since it is related to other widely spoken languages like Spanish and Portuguese, odds are that it won’t be all that foreign to you.

L’étoffe dont sont faits les rêves.

This is something like “The material of which are made the dreams”.

The word étoffe in French means ‘material’. It’s a feminine noun, which you might guess from the final –e (though that’s not really a sure-fire indication). Normally, you could tell based on the article, but since the word starts with a vowel you turn la just to l’. (The idiom il manque d’étoffe means ‘he lacks personality’, btw.)

Gender systems are pretty common around the world, not just in Indo-European languages. For example, Bantu languages across Africa have lots of genders—often between 7-10. What gender means for language learners and computational linguists is that we have to pay attention to a noun’s classification in order to know how to do stuff with it (like pluralize it) and how to handle agreement with other words like adjectives and verbs. In general, the more genders a language has, the more word forms there are that correspond to what we might want to call “the same” word.

Let’s press on. The dont is a ‘relative pronoun’ that indicates possession, so it could be translated as ‘of which’, ‘from which’.

The verbal ‘are made’ meaning is found in sont faits. The first of those words is the third-person plural present tense for ‘to be’ . Faits is from the verb faire, ‘to do’. They agree in plurality—if we were talking about the stuff that a dream was made of, we’d have est fait. In language-after-language, the verbs ‘to be’ and ‘to do’ are painfully irregular. Well, painful for the language learner. If you’re a native English speaker, when was the last time you said I am’ed or he do’ed? Frequency helps you learn (and it helps the form escape the grinding power of regularization).

Finally, les rêves are ‘the dreams’ (the singular is le rêve). That’s pretty straight-forward, so I won’t say anything more about it.

Hungarian

In Hungarian, the line is a bit more like “things, about which we dream”.

Dolgok, amikről álmodunk.

The word for ‘thing’ in Hungarian is dolog. But if you want to pluralize it, you don’t get to just add a letter at the end. Instead, you have to flip some stuff around: dolgok. There’s some fun linguistic processes at work here, so let me know in the comments if you’re interested.

Ami is the way you say ‘which’ and the k in the middle is like the k at the end of dolgok, a marker of plurality. Now, about the ending: Hungarian has a nearly-limitless supply of affixes. You add ről to indicate ‘off’ or ‘about’. Check out this link to go have your mind boggled by the major noun cases: http://www.hungarianreference.com/Nouns/. (A “case” is basically one way that a language might keep track of which words are related to which other words in what kinds of ways—for example, a nominative case marker roughly means something is the subject of the sentence and an accusative roughly means something is the object of a sentence.)

Most of the case suffixes have two forms. That’s because Hungarian has what’s called “vowel harmony”. Vowel harmony is the phonetic equivalent of “don’t wear stripes with leopard prints”. It means that you need to make the vowel in a suffix match with the noun’s last vowel. But by “match”, I don’t mean “be identical to”. Open up your mouth and say a bunch of vowels a few times—you’ll notice that some of them happen in the front of your mouth and some of them in the back. That’s what matters in Hungarian. Other languages harmonize other things, sometimes at quite some distance (meaning that there are other consonants and vowels that may intervene in between the two things that depend upon each other).

The verb álmodik is ‘to have a dream’, but you have to conjugate it. The form álmodunk is for ‘we dream’…except that Hungarians like to mess with your mind so there are actually two different ways to say ‘we dream’. The –unk ending indicates that there’s no definite object that the verb is about. Otherwise, if you wanted to say we dreamed some particular dream, then you’d need to use the –juk ending.

Turkish

In Turkish, the line is something like “dreams’ were-made material-of”.

Rüyalarin yapildigi maddeden.

For this, I’ll break it down word by word:

Rüyalarin
- Rüya is ‘dream’
- –lar is the plural
- –in means that the dream owns something (‘defined genitive case’)
yapildigi
- yap is a root of the base verb (yapmak), ‘to make, to do’
- –il is the passive
- –di is the past tense
- –gi…oh, gi. I’m going to talk about gi in a moment.
maddeden
- madde is ‘material, substance’
- –den is ‘of’ (though it is also sometimes ‘to move away from, by, via’)

Okay, you know how you hear people using impact as a verb (it used to just be a noun). Languages have all sorts of ways to change parts of speech. Sometimes you just take a word and leave it as-is (like impact), but other processes work, too (noun-ify is a verb from a noun, noun-y is an adjective from a noun, nouniness is a noun from an adjective from a noun).

In Turkish, the –gi turns a verb into an adjective. In this case, that lets it get tied to a noun. You can’t just use –gi willy-nilly, though. You can only use it with some conjugations. (Fwiw, if you drop the noun that the adjectivized verb is modifying, then you can use it as a noun instead and keep on appending affixes.)

Turkish is also a great example of why ‘keyword’ based Natural Language Processing is not sufficient in many languages, as most of the action is happening within the words, but we’ll leave more about suffixes and prefixes for another post.

One of the reasons this Turkish translation is good is because it evokes the standard Turkish translation of Shakespeare. Part of what you might hear in Sam Spade’s line is from The Tempest: “Leave not our rack behind. We are such stuff / As dreams are made on; and our little life / Is rounded with a sleep”. In Turkish, the middle part is ruyalarin yapildigi maddeden yapilmayiz biz…”, so the subtitle gets to evoke it for Turkish speakers, too.

Now that you’ve been vexed on your tongue and troubled in your brain, we’ll sign off. Go still your beating mind.

– Tyler Schnoebelen (@TSchnoebelen)

ps–Thanks very much to Bence Farkas and Ali Alpay for their help!

pps–The line we’ve worked on here is probably one of the most famous in film noir…but it actually doesn’t appear in Dashiell Hammett’s story.

Comments Leave a Comment
Categories Uncategorized

← Older Entries

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Economic powerhouse languages

How NLPable are these languages?

Opinionated tweets

GIF pronunciations and the CMU Pronuncing Dictionary

Crowdsourcing and corpus studies

GIF: The jood, the bad, and ujly

Why sound-symbol mismatch is not all bad

Corpus linguistics and the NBA playoffs

Discovering linguistic diversity

Top pop songs corpus

How the world communicates

The Multilingual Falcon

French

Hungarian

Turkish

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?