Archive | February, 2012

Like, let’s go to the movies, I mean…

23 Feb

Ooh, check out the Cornell Movie Dialogs Corpus that Lillian Lee and Cristian Danescu-Niculescu-Mizil have made available! (Here’s their work on accommodation/priming/engagement, including their rationale for using a movie corpus.)

The corpus features conversations between over 9,000 characters in 617 movies. D-N-M&L have marked it up with a lot of interesting information: the gender of who’s talking, what position they are in the film credits, what genres and ratings the movie gets in IMDB, etc. In this post I’m going to look at like and I mean.

The distribution of like

We really want to focus on “discourse like” (as in, That’s, like, awesome). To get rid of examples like the one in this sentence here or in I like this corpus, I restrict myself to examples of like that have a comma after the like (using a comma-before gets too many “I should’ve married some in the family, like you” matches). There are 346 lines that match.

The first thing you’re going to guess is that this is going to occur a lot in comedies–and you’re right. It occurs in comedies 1.64 times more often than we’d expect if it were just distributed across genres by chance. Maybe it’ll surprise you more to know that it also occurs a lot in “Crime” genre films. Discourse like does NOT like to occur in action/adventures or mysteries, though.

You’re probably also going to guess that female characters use it more than male characters–and you’re right. In fact, it’s when a female character is talking to another female character that they use it the most. But the counts are kind of low here since the corpus is not completely gender-annotated.

I didn’t really have a guess about whether protagonists would be using it more than minor characters. Just taking “is a character higher up in the credits talking to a character lower down in the credits”, we see that the more “important” a character is, the MORE they use discourse like. The effect is especially strong if the person talking is first or second in the credits and they’re talking to someone who appears fifth or lower.

Note that using position as a measure of character importance is a little tricky. For example, Melissa McCarthy is nominated for an Oscar this year for Best Supporting Actress in Bridesmaid–but she’s listed 16th in the credits. And Gary Oldman is up for Best Actor for Tinker, Tailor, Soldier, Spy and he’s actually 7th in the credits there. But these are outliers. Mostly, the characters with the most screen time and the biggest bang are higher up in the credits (this is confounded with the fact that actors/actresses have something to do with the credit-rolls, too).

The use of I mean

There are 2,353 I mean‘s in the corpus.

The movies put the most I mean‘s in the mouths of female characters–they use about the same rate whether they’re talking to men or women. Male characters speaking to female characters also use a fair amount of I mean–which really means that the “odd ball” group is the males-speaking-to-males. They’re they only group that’s constrained against using I mean (compared to what would’ve happened at chance).

In terms of genre, it’s in romances, comedies, and dramas that you get the most I mean‘s and in thrillers where you get the least.

Comparing the interlocutors’ positions in the credits, there’s not as much of a hiearchy thing happening with I mean. One thing that is strange is that while characters who are first in the credits use about as much I mean as we’d predict (based on their overall line counts and the overall percentage of lines-anyone-has with I mean), the characters in the second position are using A LOT of I mean.

When we look for interactions between credit-position and genre, we generally see that these characters do the same thing. That is, neither the 1st or 2nd person in the credits of thriller is using much I mean. But both 1st and 2nd positions are using a lot of I mean in dramas.

They part ways in comedies and sci-fi. In comedies, the 1st position uses a lot of I mean, while the 2nd credited character uses very little. My sense is that I mean is a great resource in comedies for someone who has to explain themselves a lot and that’s the main protagonist in a comedy–they’re the ones who are put in spots that require clarification:

JUNO: My dad went through this phase where he was obsessed with Greek and Roman mythology. He named me after Zeus’s wife. I mean, Zeus had other lays, but I’m pretty sure Juno was his only wife. She was supposed to be really beautiful but really mean. Like Diana Ross.

In sci-fi, it’s reversed. The 1st-credited characters are very restricted from using I mean, while the 2nd-credited characters use it A LOT. This has something to do with explaining yourself again, but genre conventions are different. The hero of a sci-fi movie doesn’t do a lot of I mean’s.

You don’t get Ripley from Aliens talking about I mean (she has one example, but it’s That’s not what I mean). By contrast, “Newt”, the young girl who is the colony survivor (and who is #2 in the credits), says:

NEWT: Isn’t that how babies come? I mean people babies…they grow inside you?

Hm. I guess I’m going to stop now…with a really creepy line if you know anything about its context.

[Update 2/29/2012: After I published this article, I started to wonder whether the quote I gave for Newt really counts as a good example of discourse “I mean”. I think the findings are still true, but this super-cool example may not be as super-cool as I initially thought. What do you think?]

You’ve got a text, now get easy frequency and collocation information

21 Feb

You can find my intro to the Corpus of Contemporary American English here, but there’s a related site called that will let you enter a bunch of text and then tell you all about them.

Here’s what it does:

* It highlights medium and low frequency words (and create lists of these words you can use offline)

* You can see how “academic” the text is

* You can click on a word and get its frequency, frequency-per-genre (spoken/fiction/magazine/newspaper/academic), its top collocates (nearby words), synonyms, and related words.

* At the phrase level, you can highlight a phrase and it’ll show you related phrases from COCA. The example Mark Davies gives is clicking on “potent argument” would show you “strong/persuasive/convincing argument”, which are all more common.

Intro to corpus linguistics

16 Feb

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings.


Glosses for grammatical and ungrammatical sentences

16 Feb

ODIN, the Online Database of INterlinear Text (I know, I know, why not “ODIT”?) collects sentences from linguistics journals that have interlinear glosses.

They’ve got data on over a thousand languages–and probably most useful, they give you something grammars usually don’t–what ISN’T allowed (or at least what is claimed to be disallowed).

For example, from Broadwell (2002) on Amharic:

    28) *wädä [[yä=Yohannïs]PP bet]]NP
    toward of=John       house
    (`toward John's house')


    29) wädä [[Yohannïs]PP bet]]NP
    toward John        house
    `toward John's house'

Chat room corpus

14 Feb

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource.

But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people are anonymized.

Read more about it here at the Westbury Lab’s site or go get the data, which is stored on Amazon (150 MB chunks adding up to 40 GB).

[Update 2/29/2012: Here’s some more stuff though it looks like the counts are kind of small:; also check out the comments for another corpus to check out.]

Native American language resources

8 Feb

Yesterday I got a chance to hear Kayla Carpenter, Maryrose Barrios, and Justin Spence talk about preserving California Indian languages. (Kayla and Justin are grad students in the Berkeley linguistics department; Maryrose is an undergrad doing physics, including preservation work on really old audio records of native songs, stories, etc.)

If you’re a linguist, there’s all sorts of stuff to look at. If you’re a Native American, resources are getting easier and easier to get at. (There’s a lot of sensitivity to the idea that earlier work between researchers and community members ended up sending stuff into a black box, so current folks are trying to make both new and old materials more accessible for non-linguists.)

Thanks to Justin for sending me not just a list of resources but notes on them, too:

At a national level, you might want to check out the National Anthropological Archives at the Smithsonian and the American Philosophical Society (this latter one is where Sapir’s notes are and Sapir studied lots of languages around the turn of the last century and took really good notes).

California has historically had the greatest density of native languages and folks at Berkeley have been archiving stuff for a long time. There are four main archives:

  • P.A. Hearst Museum of Anthropology (it’s got pre-1950 audio stuff).
  • Bancroft Library (paper stuff, pre-1950)
  • Post-1950, you can consult the Berkeley Language Center (audio) and the Survey of California and Other Indian Languages (paper stuff). They’ve recently combined their catalogs to make searching easier: There are a lot of digital resources here (scanned images and digital audio).
    • (It sounds like you can find records of what’s at the Hearst using CLA, too.)

Regional archives also have surprising stuff. Justin gives two examples:

  • Pliny Earle Goddard’s materials on Californian Athabaskan languages are mostly at Bancroft and the APS, but his Lassik notebooks are at the University of Washington (Melville Jacobs Papers collection, they are apparently marked up with Harry Hoijer’s annotations).
  • J.P. Harrington’s archives are mostly at the National Anthropological Archives, but the Barbareno Chumash materials are in the Santa Barbara Museum of Natural History.

Finally, Justin says the best summary of archival materials for languages of California is in Victor Golla’s recent book:


African language corpora

8 Feb

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest).

And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) squat about any African languages. My hope is this post will be useful for Africanists who are curious about big collections of spoken/written texts. But even more so, I’m hoping that researchers who stick to English and other languages with easy, familiar corpora will take an interest and start playing with the data. Fwiw, people who study African languages are really friendly–so if you’re, say, a computational linguist, it’ll be no problem to find you a buddy.

A number of the resources below should help you find out more about areas/languages outside of Africa, too.

Why study African languages?

Larry Hyman gives some answers here (my favorite part starts on page 14–Leggbo has subject-verb-object word order for affirmative sentences, but subject-object-verb word order in the negative). The Economist gives its answers here (of note: 600 million mobile phone users in Africa–more than Europe or America).

My own answers come from an initial shock of studying Zulu after getting lessons as an anniversary gift. The clicks are wonderful, it’s true. But have you seen the morphology? Zulu is a temptress. Okungumthakathi kuyangikhwifa (‘The damn witch is bewitching me.’)

My more recent work has taken me to Ethiopia–also a beautiful country and packed with very different, very understudied languages. In one village I get to work on three completely unrelated languages (Shabo, Majang, Shekkacho). It’ll be a while before there are any corpora of size on these languages (Shabo has 300 speakers), but Amharic is a major language with a sizable population. It’s Semitic, so if you squint your eyes, you can see a relationship to Arabic and Hebrew, but Amharic is more distant than those two. Cushitic influenced it a lot, so you get subject-object-verb order, for example. You also get ejectives (make a plosive sound like t/k/p/d, there–you breathed through them. Ejectives are like those but just use the air stored above your glottis).

Finding African language corpora

First a request: if you know of corpora based on naturalistic spoken conversations, please let me know.

Kevin Scannell makes several great resources available for people studying less common languages:

It’s worth trucking over to African Language Technology and searching their archives/posting questions. Right off the bat, it has a number of resources, including:

And it’s also worth checking out all the different resources at OLAC.
For Zulu, check out the Ukwabelana Corpus, prepped by folks doing computational linguistics.

For a treasure-trove of Ndebele/Zulu lexical info–and a huge number of other Bantu languages–take a look at CBOLD (if you have issues, there’s an old version at Berkeley).

ALLEX offers resources for Ndebele, Shona, and Nambya.

Moving away from Bantu, for Amharic, there’s a whole site for resources to help you: (Speaking of Semitic languages, if you want Arabic, check out my earlier post here).

If you’re interested in Hausa, a good place to start is Uwe Seibert’s Hausa Online.

Also don’t forget that resources like the BBC have written/spoken content in Hausa, Somali, Swahili, Kinyarwanda/Kirundi.

Personally, I’m fascinated by this resource on information structure for 25 sub-Saharan languages.

DoBeS has a number of languages archived, those these tend to be small, highly endangered languages, so if you only want giant amounts of text, this is the wrong place, probably. On the other hand, if you want to have a huge impact…EMELD is another place people archive information.

From the LDC you might be interested in:

To make your own corpus from web texts, consider CorpusCollie (the example they use is Luo).