June | 2013 | Corpus linguistics

Archive | June, 2013

People like John Wayne

In natural language processing (NLP), it is common for systems to ignore fragments of greater than three words. This makes sense from a machine-learning point of view, as the number of possible combinations of four or more words in any language is astronomically high and quickly hits memory and processing constraints.

But a fragment, like the title of this article, can be ambiguous:

People like John Wayne movies
People like John Wayne in movies
People like John Wayne act in movies

The main topic of the first sentence ‘John Wayne movies’, the main topic of the second sentence is just ‘John Wayne’ and the main topic of the third sentence is ‘movies’—they are subtly more focussed on different parts of the sentence. The first two examples above express positive sentiment, the third does not. Without a deeper understanding of the sentences, just a few extra words can make a big difference.

The third sentence uses ‘like’ to encode information in a way that the first two do not: John Wayne is a person. It is obvious to us, but maybe not to a computational knowledge base. By interpreting these kinds of sentences correctly, we can therefore extract information about the world, in addition to understanding the sentences themselves. This article is about this third type of example: People like John Wayne act in movies.

In computational linguistics, we called these kinds of relationships hyponyms: “states like California”, “companies like IBM”, “trees like oak”, etc. They represent a specific type or instance of the general form (See Marti Hearst’s seminal 1992 work).

“I went to the general store but they wouldn’t let me buy anything specific.” – Steven Wright

One of the ways to give someone a toehold in a conversation or a report is to give an example. People often have a hard time with abstractions, so examples help increase understanding. They offer other rhetorical benefits, too: you may also want to use examples to introduce some particular example that you want to make into the main topic. Maybe you want to talk about GE, the US Supreme Court, Havana, the Outback, Rafael Nadal, or Nelson Mandela.

I’ve been thinking about this since I saw the phrase “carriers like Lufthansa”. There’s a specific. Let’s go general. I give you the construction “X like Y” and tell you that “Y” is some kind of named entity. What kind of Y is most common: Location, Organization, or Person? Whether you’re doing sentiment analysis, opinion mining, question-and-answer matching, or any of a number of natural language processing tools, you want your system to be able to identify and and distinguish these kinds of named entities at a minimum: e.g., “what am I doing sentiment analysis about?”. Part of building a good system is understanding the distributions in contexts like the one we’re talking about here.

For a first pass, let’s go to the Corpus of Contemporary American English, which is a great site for exploring these kinds of questions very quickly. I look for all “Noun like ProperNoun” constructions and grab all of the matches that have at least 3 occurrences. The number one most popular example is “states like California”. (Btw, often California is the only example state listed. Well may we wonder: how many other states *are* like California? When examples are provided, the most California-like states are Florida and Texas. Your objections are noted.)

This example illustrates the general theme, journalists love “states/cities/places/countries/places like Y”. 62% of all of the “X like Y” examples have Y as a location. (It’s 48% if you use “types” instead of “tokens”. Tokens let you count each occurrence of “states like California”, types say “nope, that’s just one instance, even if it occurs 87 times”.) The next thing to note, of course, is how often organizations/locations/people are mentioned throughout the corpus. I haven’t relativized the COCA numbers below but in general, we see organizations get the most mentions. (Alas, organizations are also the hardest of these three labels to get right.)

– Tyler Schnoebelen (@TSchnoebelen)

Comments Leave a Comment
Categories Uncategorized

The weirdest languages

21 Jun

Originally published over on Idibon.com

We’re in the business of natural language processing with lots of different languages. So far we’ve worked on (big breath): English, Portuguese (Brazilian and from Portugal), Spanish, Italian, French, Russian, German, Turkish, Arabic, Japanese, Greek, Mandarin Chinese, Persian, Polish, Dutch, Swedish, Serbian, Romanian, Korean, Hungarian, Bulgarian, Hindi, Croatian, Czech, Ukrainian, Finnish, Hebrew, Urdu, Catalan, Slovak, Indonesian, Malay, Vietnamese, Bengali, Thai, and a bit on Latvian, Estonian, Lithuanian, Kurdish, Yoruba, Amharic, Zulu, Hausa, Kazakh, Sindhi, Punjabi, Tagalog, Cebuano, Danish, and Navajo.

Natural language processing (NLP) is about finding patterns in language—for example, taking heaps of unstructured text and automatically pulling out its structure. The open secret about NLP is that it’s very English-centric. English is far and away the language that linguists have worked on the most and it’s also the language that has the most available resources for computer science projects (and more data is almost always better in computer science). So one of the best ways to test an NLP system is to try languages other than English. The better that a system can deal with diverse data, the more confident that you can be in its ability to handle unseen data.

To this end, we might choose to define “weirdness” in terms of English. But that’s a pretty irritating definition. Let’s try to do something different.

A global method for linguistic outliers

The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things—192 different language features in total.

So rather than take an English-centric view of the world, WALS allows us take a worldwide view. That is, we evaluate each language in terms of how unusual it is for each feature. For example, English word order is subject-verb-object—there are 1,377 languages that are coded for word order in WALS and 35.5% of them have SVO word order. Meanwhile only 8.7% of languages start with a verb—like Welsh, Hawaiian and Majang—so cross-linguistically, starting with a verb is unusual. For what it’s worth, 41.0% of the world’s languages are actually SOV order. (Aside: I’ve done some work with Hawaiian and Majang and that’s how I learned that verbs are a big commitment for me. I’m just not ready for verbs when I open my mouth.)

The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these—dropping us down to 1,693 languages).

Now, one problem is that if you just stop there you have a huge amount of collinearity. Part of this is just the nature of the features listed in WALS—there’s one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Ideally, we’d like to judge weirdness based on unrelated features. We can focus in on features that aren’t strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). We end up with 21 features in total.

For each value that a language has, we calculate the relative frequency of that value for all the other languages that are coded for it. So if we had included subject-object-verb order then English would’ve gotten a value of 0.355 (we actually normalized these values according to the overal entropy for each feature, so it wasn’t exactly 0.355, but you get the idea). The Weirdness Index is then an average across the 21 unique structural features. But because different features have different numbers of values and we want to reduce skewing, we actually take the harmonic mean (and because we want bigger numbers = more weird, we actually subtract the mean from one). In this blog post, I’ll only report languages that have a value filled in for at least two-thirds of features (239 languages).

The outlier (weirdest) languages

The language that is most different from the majority of all other languages in the world is a verb-initial tonal languages spoken by 6,000 people in Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el Grande Mixtec). Number two is spoken in Siberia by 22,000 people: Nenets (that’s where we get the word parka from). Number three is Choctaw, spoken by about 10,000 people, mostly in Oklahoma.

But here’s the rub—some of the weirdest languages in the world are ones you’ve heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin. And actually English is #33 in the Language Weirdness Index.

CorrectedNormHarMean25weirdestlanguagesminimum13of21oftheunderpt2correlations

The 25 weirdest languages of the world. In North America: Chalcatongo Mixtec, Choctaw, Mesa Grande Diegueño, Kutenai, and Zoque; in South America: Paumarí and Trumai; in Australia/Oceania: Pitjantjatjara and Lavukaleve; in Africa: Harar Oromo, Iraqw, Kongo, Mumuye, Ju|’hoan, and Khoekhoe; in Asia: Nenets, Eastern Armenian, Abkhaz, Ladakhi, and Mandarin; and in Europe: German, Dutch, Norwegian, Czech, and Spanish.

By the way, how awesome of a name is “Pitjantjatjara“? (Also: can you guess which one of the internal syllables is silent?)

Questions and pronouns: two example features

This is odd. Is this odd? One of the features that distinguishes languages is how they ask yes/no questions.The vast majority of languages have a special question particle that they tack on somewhere (like the ka at the end of a Japanese question). Of 954 languages coded for this in WALS, 584 of them have question particles. The word order switching that we do in English only happens in 1.4% of the languages. That’s 13 languages total and most of them come from Europe: German, Czech, Dutch, Swedish, Norwegian, Frisian, English, Danish, and Spanish.

But there is an even more unusual way to deal with yes/no questions and that’s what Chalcatongo Mixtec does: which is to do nothing at all. It is the only language surveyed that does not have a particle, a change of word order, a change of intonation…There is absolutely no difference between an interrogative yes/no question and a simple statement. I have spent part of the day imagining a game show in this language.

Another thing languages have to deal with is what to do with simple subjects like I, they, or it. These are called pronominal subjects (something like The minister prevaricated has a nominal subject). The most common way to do this is to just tack the information about the subject on to the verb—437 out of 711 languages do this, like Spanish, Italian, and Portuguese. But Dutch, German, and Norwegian—like English—prefer having special subject pronouns that are normally/obligatorily present. But this is only done by 82 of the 711 languages coded in WALS. Kutenai (100 speakers in British Columbia, Canada) and Mumuye (400,000 speakers in Nigeria) do something even more unusual: they have something like subject pronouns but these go in different positions in the syntax than where full noun phrases go. And even more unusual than this is Chalcatongo Mixtec again: they combine several strategies so they have both subject markers that they add to verbs and they have pronoun words, too. But these pronoun words appear in a different spot from where a full noun phrase would show up.

The 5 least weird languages in the world

Now if I asked you to consider these languages, how weird would you say they were? Lithuanian, Indonesian, Turkish, Basque, and Cantonese. Surprise! They are really low on the Weirdness Index. They don’t seem typical to linguists and language learners but for these 21 features they stick with the crowd. Notice that we get isolates (like Basque) distributed throughout levels of Weirdness. Basque is “typical” but Kutenai, another isolate, is one of the weirdest of all languages. Even more surprising is that Mandarin Chinese is in the top 25 weirdest and Cantonese is in the bottom 10. This has to do with the fact that they have different sounds: Mandarin, unlike Cantonese has uvular continuants and has some limits on “velar nasals” (like English, Mandarin can have a sound like at the end of song but it can’t have that sound at the beginning of words—worldwide it’s rare to have that particular restriction).

At the very very bottom of the Weirdness Index there are two languages you’ve heard of and three you may not have: Hungarian, normally renowned as a linguistic oddball comes out as totally typical on these dimensions. (I got to live in Budapest last summer and I swear that Hungarian does have weirdnesses, it just hides them other places.) Chamorro (a language of Guam spoken by 95,000 people), Ainu (just a handful of speakers left in Japan, it is nearly extinct), and Purépecha (55,000 speakers, mostly in Mexico) are all very normal. But the very most super-typical, non-deviant language of them all, with a Weirdness Index of only 0.087 is Hindi, which has only a single weird feature.

Part of this is to say that some of the languages you take for granted as being normal (like English, Spanish, or German) consistently do things differently than most of the other languages in the world. It reminds me of one of the basic questions in psychology: to what extent can we generalize from research studies based on university students who are, as Joseph Henrich and his colleagues argue, Western Educated Industrialized Rich and Democratic. In other words: sometimes the input is WEIRD and you need to ask yourself how that changes things.

You’re weird

Even though the methods here don’t define things in terms of English, they still smuggle in some cultural-specificity. That is, the linguists who developed and annotated the features were mostly speakers of European languages. What features might a person from Papua New Guinea or Ethiopia or the Amazon have come up with instead? And of course, WALS doesn’t have any data at all on about 4,000 languages. And the languages that it has the most data for are not truly random.

Despite this, English still ranks as highly unusual (it comes in as #33 with an index value of 0.756). That English-speaking brain you’ve been using to read this? It’s wired weird.

– Tyler Schnoebelen (@TSchnoebelen)

Appendix: The tops and bottoms

Here are the values for the top and bottom 10 languages. You might also check out our posts on:

Economic powerhouse languages (which languages are associated with places that have high/growing GDPs)?
NLP for all languages (expanding NLP beyond English to other languages, including low-resource and/or endangered languages)
The Multilingual Falcon (a tour of French, Hungarian, and Turkish by way of translations of The Maltese Falcon)

Rank	Language	Weirdness Index
1	Mixtec (Chalcatongo)	0.972
2	Nenets	0.935
3	Choctaw	0.924
4	Diegueño (Mesa Grande)	0.920
5	Oromo (Harar)	0.919
6	Kutenai	0.908
7	Iraqw	0.900
8	Kongo	0.883
9	Armenian (Eastern)	0.861
10	German	0.858
…
230	Basque	0.189
231	Bororo	0.153
232	Quechua (Imbabura)	0.151
233	Usan	0.151
234	Cantonese	0.143
235	Hungarian	0.132
236	Chamorro	0.128
237	Ainu	0.128
238	Purépecha	0.100
239	Hindi	0.087

Update: Here is the full list, with the 21 weirdness features and all of the languages that had values for at least one of them (don’t trust those values, of course).

Weirdness_index_values_full_list

Comments Leave a Comment
Categories Uncategorized

Justice Kennedy’s favorite phrases

12 Jun

In the chambers of the United States Supreme Court, nine men and women are deciding what’s going to happen with same-sex marriage in America. Will a widow get back taxes from her wife’s estate? Will same-sex marriage be reinstated in California? Or if they rule more broadly, will same-sex marriage be made legal across all 50 states, not just 12?

The decisions are likely come down to one single person: Supreme Court Justice Anthony Kennedy. Expert court-watchers agree that it’s clear how the other eight justices will vote (four inclined to support same-sex marriage, four disinclined).

If we could predict the outcome of court cases, we would have retired to our own islands long ago. But what we *can* do, is look at the communications of Kennedy in this court case, and see if his patterns of communication significantly differ from how he has communicated in past court proceedings.

First let’s look at some of the phrases that Justice Kennedy uses a lot more than all the other justices (relative to how much he’s speaking overall). We’ll be using this fantastic data set from Cristian Danescu-Niculescu-Mizil et al. Again, this is relative to all the justices but I’ll put in notes for how Scalia and Ginsburg use the phrase for comparison. In the infographic, the way you get “expected” values is to take the total number of times anyone on the Court says a word/phrase and then multiply it by how much a particular justice is speaking overall. If there were 100 uses of “foo” across all the justices and Justice X spoke 10% of all the words, we’d expect them to have 10 “foo”s. We want to pay attention to when observed/expected ratios are particularly high or low: those are phrases worth further inquiry.

judge (1)

Kennedy also seems to like in this case, I take it, can you tell, you want us, let me ask, and so forth, and I’m not sure relative to all the other justices. Compared to all the other justices, he seems to avoid I don’t, you don’t, don’t know, and you’re saying.

Most of these top phrases are the kinds of things you might be inclined to toss away if you were trying to do “topic detection”. But in opinion detection and sentiment analysis, they are much more likely to carry an important signal. Take well. Well is one of the most frequent “discourse markers” to pop up in English speech. Certainly it pops up a lot in Kennedy’s speech. What’s it doing?

Well often indicates a topic change but it can also mark an elaboration or explanation—in that way it’s kind of like a be that as it may or that said. Well can mark a kind of insufficiency in what’s been said/what’s about to be said. It can serve as a pause filler (like um or uh). It often marks the introduction of reported speech. My own favorite (though wordy) definition is from Andreas Jucker (1993):

[Well is] a signpost that directs the addressees to renegotiate the relevant background assumptions, either because a new set of assumptions becomes relevant or because some of the manifest assumptions are mistaken.

And if we look at how Kennedy is using well in the same-sex marriage cases, that seems about right (note that these cases were not included in the data in the chart above). I should probably give you the preceding context since they are so clearly responsive to what’s come before. But in the interest of space, I’m just going to give the utterances:

Well, that — that assumes the premise. We didn’t — the House didn’t know it was unconstitutional. I mean —
Well, why not? They’re concerned about the argument and you say that the House of Representatives standing alone can come into the court. Why can’t the Senate standing alone come into court and intervene on the other side?
Well, it applies to over what, 1,100 Federal laws, I think we’re are saying. {This is a lengthy comment/question by Kennedy that is worth reading–he’s grappling with the fact that marriage is clearly a power for the states but the Federal government has all sorts of stuff going on in the citizen’s lives regarding marriage.}
Well, but it’s not really uniformity because it regulates only one aspect of marriage. It doesn’t regulate all of marriage.
Well, then are — are you conceding the point that there is no harm or denigration to traditional opposite-sex marriage couples. So you’re conceding that.
Well, but, then it — then it seems to me that you should have to address Justice Kagan’s question.
Well, the Chief — the Chief Justice and Justice Kagan have given a proper hypothetical to test your theory. {This quote also goes on as Kennedy lays out test again to think through the issue of “standing”–that is, who has the right to bring a case forward.}

This does seem to signal Kennedy challenging what’s been said and it matches Jucker’s definition reasonably well.

But of course, we’re most curious about how Kennedy speaks in the oral arguments based on how he’s ultimately going to vote. When Kennedy is going to end up voting with Ginsburg and against Scalia, he tends to use the phrasing whether or not (he uses this phrase over 8 times more often than we’d expect when he’s going to vote with Ginsburg). He also tends to use the words can, can’t, or, your, I’m, is that, and argument when he’s ultimately going to end up voting with Ginsburg.

By contrast, when Kennedy is going to vote with Scalia and against Ginsburg, he tends to use there is, that’s, same, and government. He also uses a lot more of the past tense when voting with Scalia (particularly has). Kennedy also uses a lot of this when he’s going to vote with Scalia against Ginsburg—in particular this case. (For more about how interesting demonstratives are, see the overview/links in this post.)

But notice that these signals are rather weak. That’s because across 192 cases that came before the Court before the same-sex marriage cases, Kennedy, Scalia, and Ginsburg voted together in 108 of them (Kennedy voted with Scalia and against Ginsburg in 43, and with Ginsburg against Scalia in 28. And with neither one of them in 13).

So how is Kennedy going to vote? Well…

Appendix: Other text analyses

Here’s a collection of links with legal scholars, journalists and others interpreting Kennedy:

Erwin Chemerinsky: ABAJournal and SCOTUSblog
Dana Milbank: Washington Post
Sahil Kapur: Talking Points Memo here and here
Nina Totenberg: NPR here and here
Dylan Scott: Governing
John Bursch: SCOTUSblog here and here
Lyle Denniston: SCOTUSblog here and here
Ilya Somin: The Volokh Conspiracy
Amy Howe: SCOTUSblog
Marty Lederman: SCOTUSblog
Adam Liptak: NYTimes
Jeffrey Rosen: The New Republic
Peter Dreier: Huffington Post

Notice that one of the things a few of the people comment on is “tone of voice”—Nina Totenberg mentions Kennedy sounding “ticked off”. That’s a reminder that using transcripts alone wipes out a lot of powerful phonetic cues.

– Tyler Schnoebelen (@TSchnoebelen)

Comments Leave a Comment
Categories Uncategorized

The street where you live

7 Jun

[sociable/]

The State of the Map conference is happening this weekend here in San Francisco. In honor of that, some facts about the world’s places courtesy of Open Street Maps enormous pile of data. We’ve been processing ~28,000,000 entries as part of named entity recognition (10,000,000 unique place names, but the conventions of the data mean that what is essentially one highway could be in there multiple times as it crosses various geographic boundaries).

There are more 2nd Streets than 1st Streets (this is because in many places the “1st” street is known as the “Main” street). There are 2.8 times as many 2nd Streets as 2nd Avenues.

Most numbers have a street or avenue associated with them: 1st, 2nd, 3rd…99th…237th…all the way up to 488th Street, then there start to be gaps (is there something wrong with 489th, 494th, 496th, and 498th streets?). The biggest hole is probably between 3200th Street and 4400th. Nowheresville is made up of those streets.

Btw, Open Street Maps tells us there are Nowhere Roads in 10 different zip codes. But, uh, most of these are in the South. Actually, 5 of them are in Georgia.

There are 11 times as many places with “Circle” in them as with “Square”. My favorite is probably ხალხთა მეგობრობის მოედანი (“Khalkhta Megbroba Square”, well, “Friendship Square” in Georgia. But the other Georgia. Isn’t Georgian script cool?)

There are lots of tree streets: Oak, Pine, Elm, Maple, Walnut, Cedar, Chestnut, Cherry, Ash, Birch, Poplar. There are almost twice as many Willow Creeks as Willow Streets.

There are quite a few Pleasant Streets. If that bores you, may I suggest Evil Avenue in St. Clair, Illinois? Or point you towards Mount Evil in Tennessee?

– Tyler Schnoebelen (@TSchnoebelen)

[sociable/]

Comments Leave a Comment
Categories Uncategorized

Horse name corpus for Belmont Stakes

6 Jun

The Belmont Stakes (the third of the Triple Crown) are happening this weekend. Check out this blog post on horse names:

http://idibon.com/back-the-right-horse-name/

Does anyone have an even more enormous corpus of horse names? I grabbed and analyzed the top 4 finishers for the last 137 years of the Kentucky Derby and showed the diversity of naming schemes.

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

People like John Wayne

“I went to the general store but they wouldn’t let me buy anything specific.” – Steven Wright

The weirdest languages

A global method for linguistic outliers

The outlier (weirdest) languages

Questions and pronouns: two example features

The 5 least weird languages in the world

You’re weird

Appendix: The tops and bottoms

Justice Kennedy’s favorite phrases

Appendix: Other text analyses

The street where you live

Horse name corpus for Belmont Stakes

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?