The weirdest languages

21 Jun

Originally published over on Idibon.com

We’re in the business of natural language processing with lots of different languages. So far we’ve worked on (big breath): English, Portuguese (Brazilian and from Portugal), Spanish, Italian, French, Russian, German, Turkish, Arabic, Japanese, Greek, Mandarin Chinese, Persian, Polish, Dutch, Swedish, Serbian, Romanian, Korean, Hungarian, Bulgarian, Hindi, Croatian, Czech, Ukrainian, Finnish, Hebrew, Urdu, Catalan, Slovak, Indonesian, Malay, Vietnamese, Bengali, Thai, and a bit on Latvian, Estonian, Lithuanian, Kurdish, Yoruba, Amharic, Zulu, Hausa, Kazakh, Sindhi, Punjabi, Tagalog, Cebuano, Danish, and Navajo.

Natural language processing (NLP) is about finding patterns in language—for example, taking heaps of unstructured text and automatically pulling out its structure. The open secret about NLP is that it’s very English-centric. English is far and away the language that linguists have worked on the most and it’s also the language that has the most available resources for computer science projects (and more data is almost always better in computer science). So one of the best ways to test an NLP system is to try languages other than English. The better that a system can deal with diverse  data, the more confident that you can be in its ability to handle unseen data.

To this end, we might choose to define “weirdness” in terms of English. But that’s a pretty irritating definition. Let’s try to do something different.

A global method for linguistic outliers

The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things—192 different language features in total.

So rather than take an English-centric view of the world, WALS allows us take a worldwide view. That is, we evaluate each language in terms of how unusual it is for each feature. For example, English word order is subject-verb-object—there are 1,377 languages that are coded for word order in WALS and 35.5% of them have SVO word order. Meanwhile only 8.7% of languages start with a verb—like Welsh, Hawaiian and Majang—so cross-linguistically, starting with a verb is unusual. For what it’s worth, 41.0% of the world’s languages are actually SOV order. (Aside: I’ve done some work with Hawaiian and Majang and that’s how I learned that verbs are a big commitment for me. I’m just not ready for verbs when I open my mouth.)

The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these—dropping us down to 1,693 languages).

Now, one problem is that if you just stop there you have a huge amount of collinearity. Part of this is just the nature of the features listed in WALS—there’s one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Ideally, we’d like to judge weirdness based on unrelated features. We can focus in on features that aren’t strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). We end up with 21 features in total.

For each value that a language has, we calculate the relative frequency of that value for all the other languages that are coded for it. So if we had included subject-object-verb order then English would’ve gotten a value of 0.355 (we actually normalized these values according to the overal entropy for each feature, so it wasn’t exactly 0.355, but you get the idea). The Weirdness Index is then an average across the 21 unique structural features. But because different features have different numbers of values and we want to reduce skewing, we actually take the harmonic mean (and because we want bigger numbers = more weird, we actually subtract the mean from one). In this blog post, I’ll only report languages that have a value filled in for at least two-thirds of features (239 languages).

The outlier (weirdest) languages

The language that is most different from the majority of all other languages in the world is a verb-initial tonal languages spoken by 6,000 people in Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el Grande Mixtec). Number two is spoken in Siberia by 22,000 people: Nenets (that’s where we get the word parka from). Number three is Choctaw, spoken by about 10,000 people, mostly in Oklahoma.

But here’s the rub—some of the weirdest languages in the world are ones you’ve heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin.  And actually English is #33 in the Language Weirdness Index.

CorrectedNormHarMean25weirdestlanguagesminimum13of21oftheunderpt2correlations

The 25 weirdest languages of the world. In North America: Chalcatongo Mixtec, Choctaw, Mesa Grande Diegueño, Kutenai, and Zoque; in South America: Paumarí and Trumai; in Australia/Oceania: Pitjantjatjara and Lavukaleve; in Africa: Harar Oromo, Iraqw, Kongo, MumuyeJu|’hoan, and Khoekhoe; in Asia: Nenets, Eastern Armenian, Abkhaz, Ladakhi, and Mandarin; and in Europe: German, Dutch, Norwegian, Czech, and Spanish.

By the way, how awesome of a name is “Pitjantjatjara“? (Also: can you guess which one of the internal syllables is silent?)

Questions and pronouns: two example features

This is odd. Is this odd? One of the features that distinguishes languages is how they ask yes/no questions.The vast majority of languages have a special question particle that they tack on somewhere (like the ka at the end of a Japanese question). Of 954 languages coded for this in WALS, 584 of them have question particles. The word order switching that we do in English only happens in 1.4% of the languages. That’s 13 languages total and most of them come from Europe: German, Czech, Dutch, Swedish, Norwegian, Frisian, English, Danish, and Spanish.

But there is an even more unusual way to deal with yes/no questions and that’s what Chalcatongo Mixtec does: which is to do nothing at all. It is the only language surveyed that does not have a particle, a change of word order, a change of intonation…There is absolutely no difference between an interrogative yes/no question and a simple statement. I have spent part of the day imagining a game show in this language.

Another thing languages have to deal with is what to do with simple subjects like I, they, or it. These are called pronominal subjects (something like The minister prevaricated has a nominal subject). The most common way to do this is to just tack the information about the subject on to the verb—437 out of 711 languages do this, like Spanish, Italian, and Portuguese. But Dutch, German, and Norwegian—like English—prefer having special subject pronouns that are normally/obligatorily present. But this is only done by 82 of the 711 languages coded in WALS. Kutenai (100 speakers in British Columbia, Canada) and Mumuye (400,000 speakers in Nigeria) do something even more unusual: they have something like subject pronouns but these go in different positions in the syntax than where full noun phrases go. And even more unusual than this is Chalcatongo Mixtec again: they combine several strategies so they have both subject markers that they add to verbs and they have pronoun words, too. But these pronoun words appear in a different spot from where a full noun phrase would show up.

The 5 least weird languages in the world

Now if I asked you to consider these languages, how weird would you say they were? Lithuanian, Indonesian, Turkish, Basque, and Cantonese. Surprise! They are really low on the Weirdness Index. They don’t seem typical to linguists and language learners but for these 21 features they stick with the crowd. Notice that we get isolates (like Basque) distributed throughout levels of Weirdness. Basque is “typical” but Kutenai, another isolate, is one of the weirdest of all languages. Even more surprising is that Mandarin Chinese is in the top 25 weirdest and Cantonese is in the bottom 10. This has to do with the fact that they have different sounds: Mandarin, unlike Cantonese has uvular continuants and has some limits on “velar nasals” (like English, Mandarin can have a sound like at the end of song but it can’t have that sound at the beginning of words—worldwide it’s rare to have that particular restriction).

At the very very bottom of the Weirdness Index there are two languages you’ve heard of and three you may not have: Hungarian, normally renowned as a linguistic oddball comes out as totally typical on these dimensions. (I got to live in Budapest last summer and I swear that Hungarian does have weirdnesses, it just hides them other places.) Chamorro (a language of Guam spoken by 95,000 people), Ainu (just a handful of speakers left in Japan, it is nearly extinct), and Purépecha (55,000 speakers, mostly in Mexico) are all very normal. But the very most super-typical, non-deviant language of them all, with a Weirdness Index of only 0.087 is Hindi, which has only a single weird feature.

Part of this is to say that some of the languages you take for granted as being normal (like English, Spanish, or German) consistently do things differently than most of the other languages in the world. It reminds me of one of the basic questions in psychology: to what extent can we generalize from research studies based on university students who are, as Joseph Henrich and his colleagues argue, Western Educated Industrialized Rich and Democratic. In other words: sometimes the input is WEIRD and you need to ask yourself how that changes things.

You’re weird

Even though the methods here don’t define things in terms of English, they still smuggle in some cultural-specificity. That is, the linguists who developed and annotated the features were mostly speakers of European languages. What features might a person from Papua New Guinea or Ethiopia or the Amazon have come up with instead? And of course, WALS doesn’t have any data at all on about 4,000 languages. And the languages that it has the most data for are not truly random.

Despite this, English still ranks as highly unusual (it comes in as #33 with an index value of 0.756). That English-speaking brain you’ve been using to read this? It’s wired weird.

– Tyler Schnoebelen (@TSchnoebelen)

Appendix: The tops and bottoms

Here are the values for the top and bottom 10 languages. You might also check out our posts on:

Rank Language Weirdness Index
1 Mixtec (Chalcatongo) 0.972
2 Nenets 0.935
3 Choctaw 0.924
4 Diegueño (Mesa Grande) 0.920
5 Oromo (Harar) 0.919
6 Kutenai 0.908
7 Iraqw 0.900
8 Kongo 0.883
9 Armenian (Eastern) 0.861
10 German 0.858
230 Basque 0.189
231 Bororo 0.153
232 Quechua (Imbabura) 0.151
233 Usan 0.151
234 Cantonese 0.143
235 Hungarian 0.132
236 Chamorro 0.128
237 Ainu 0.128
238 Purépecha 0.100
239 Hindi 0.087

Update: Here is the full list, with the 21 weirdness features and all of the languages that had values for at least one of them (don’t trust those values, of course).

Weirdness_index_values_full_list

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: