Archive | February, 2012

Like, let’s go to the movies, I mean…

23 Feb

Ooh, check out the Cornell Movie Dialogs Corpus that Lillian Lee and Cristian Danescu-Niculescu-Mizil have made available! (Here’s their work on accommodation/priming/engagement, including their rationale for using a movie corpus.)

The corpus features conversations between over 9,000 characters in 617 movies. D-N-M&L have marked it up with a lot of interesting information: the gender of who’s talking, what position they are in the film credits, what genres and ratings the movie gets in IMDB, etc. In this post I’m going to look at like and I mean.

The distribution of like

We really want to focus on “discourse like” (as in, That’s, like, awesome). To get rid of examples like the one in this sentence here or in I like this corpus, I restrict myself to examples of like that have a comma after the like (using a comma-before gets too many “I should’ve married some in the family, like you” matches). There are 346 lines that match.

The first thing you’re going to guess is that this is going to occur a lot in comedies–and you’re right. It occurs in comedies 1.64 times more often than we’d expect if it were just distributed across genres by chance. Maybe it’ll surprise you more to know that it also occurs a lot in “Crime” genre films. Discourse like does NOT like to occur in action/adventures or mysteries, though.

You’re probably also going to guess that female characters use it more than male characters–and you’re right. In fact, it’s when a female character is talking to another female character that they use it the most. But the counts are kind of low here since the corpus is not completely gender-annotated.

I didn’t really have a guess about whether protagonists would be using it more than minor characters. Just taking “is a character higher up in the credits talking to a character lower down in the credits”, we see that the more “important” a character is, the MORE they use discourse like. The effect is especially strong if the person talking is first or second in the credits and they’re talking to someone who appears fifth or lower.

Note that using position as a measure of character importance is a little tricky. For example, Melissa McCarthy is nominated for an Oscar this year for Best Supporting Actress in Bridesmaid–but she’s listed 16th in the credits. And Gary Oldman is up for Best Actor for Tinker, Tailor, Soldier, Spy and he’s actually 7th in the credits there. But these are outliers. Mostly, the characters with the most screen time and the biggest bang are higher up in the credits (this is confounded with the fact that actors/actresses have something to do with the credit-rolls, too).

The use of I mean

There are 2,353 I mean‘s in the corpus.

The movies put the most I mean‘s in the mouths of female characters–they use about the same rate whether they’re talking to men or women. Male characters speaking to female characters also use a fair amount of I mean–which really means that the “odd ball” group is the males-speaking-to-males. They’re they only group that’s constrained against using I mean (compared to what would’ve happened at chance).

In terms of genre, it’s in romances, comedies, and dramas that you get the most I mean‘s and in thrillers where you get the least.

Comparing the interlocutors’ positions in the credits, there’s not as much of a hiearchy thing happening with I mean. One thing that is strange is that while characters who are first in the credits use about as much I mean as we’d predict (based on their overall line counts and the overall percentage of lines-anyone-has with I mean), the characters in the second position are using A LOT of I mean.

When we look for interactions between credit-position and genre, we generally see that these characters do the same thing. That is, neither the 1st or 2nd person in the credits of thriller is using much I mean. But both 1st and 2nd positions are using a lot of I mean in dramas.

They part ways in comedies and sci-fi. In comedies, the 1st position uses a lot of I mean, while the 2nd credited character uses very little. My sense is that I mean is a great resource in comedies for someone who has to explain themselves a lot and that’s the main protagonist in a comedy–they’re the ones who are put in spots that require clarification:

JUNO: My dad went through this phase where he was obsessed with Greek and Roman mythology. He named me after Zeus’s wife. I mean, Zeus had other lays, but I’m pretty sure Juno was his only wife. She was supposed to be really beautiful but really mean. Like Diana Ross.

In sci-fi, it’s reversed. The 1st-credited characters are very restricted from using I mean, while the 2nd-credited characters use it A LOT. This has something to do with explaining yourself again, but genre conventions are different. The hero of a sci-fi movie doesn’t do a lot of I mean’s.

You don’t get Ripley from Aliens talking about I mean (she has one example, but it’s That’s not what I mean). By contrast, “Newt”, the young girl who is the colony survivor (and who is #2 in the credits), says:

NEWT: Isn’t that how babies come? I mean people babies…they grow inside you?

Hm. I guess I’m going to stop now…with a really creepy line if you know anything about its context.

[Update 2/29/2012: After I published this article, I started to wonder whether the quote I gave for Newt really counts as a good example of discourse “I mean”. I think the findings are still true, but this super-cool example may not be as super-cool as I initially thought. What do you think?]

You’ve got a text, now get easy frequency and collocation information

21 Feb

You can find my intro to the Corpus of Contemporary American English here, but there’s a related site called http://www.wordandphrase.info that will let you enter a bunch of text and then tell you all about them.

Here’s what it does:

* It highlights medium and low frequency words (and create lists of these words you can use offline)

* You can see how “academic” the text is

* You can click on a word and get its frequency, frequency-per-genre (spoken/fiction/magazine/newspaper/academic), its top collocates (nearby words), synonyms, and related words.

* At the phrase level, you can highlight a phrase and it’ll show you related phrases from COCA. The example Mark Davies gives is clicking on “potent argument” would show you “strong/persuasive/convincing argument”, which are all more common.

Intro to corpus linguistics

16 Feb

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings.

http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx

 

Glosses for grammatical and ungrammatical sentences

16 Feb

ODIN, the Online Database of INterlinear Text (I know, I know, why not “ODIT”?) collects sentences from linguistics journals that have interlinear glosses.

http://odin.linguistlist.org/

They’ve got data on over a thousand languages–and probably most useful, they give you something grammars usually don’t–what ISN’T allowed (or at least what is claimed to be disallowed).

For example, from Broadwell (2002) on Amharic:

    28) *wädä [[yä=Yohannïs]PP bet]]NP
    toward of=John       house
    (`toward John's house')

Versus:

    29) wädä [[Yohannïs]PP bet]]NP
    toward John        house
    `toward John's house'

Chat room corpus

14 Feb

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource.

But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people are anonymized.

Read more about it here at the Westbury Lab’s site or go get the data, which is stored on Amazon (150 MB chunks adding up to 40 GB).

[Update 2/29/2012: Here’s some more stuff though it looks like the counts are kind of small: http://caw2.barcelonamedia.org/node/7; also check out the comments for another corpus to check out.]

Native American language resources

8 Feb

Yesterday I got a chance to hear Kayla Carpenter, Maryrose Barrios, and Justin Spence talk about preserving California Indian languages. (Kayla and Justin are grad students in the Berkeley linguistics department; Maryrose is an undergrad doing physics, including preservation work on really old audio records of native songs, stories, etc.)

If you’re a linguist, there’s all sorts of stuff to look at. If you’re a Native American, resources are getting easier and easier to get at. (There’s a lot of sensitivity to the idea that earlier work between researchers and community members ended up sending stuff into a black box, so current folks are trying to make both new and old materials more accessible for non-linguists.)

Thanks to Justin for sending me not just a list of resources but notes on them, too:

At a national level, you might want to check out the National Anthropological Archives at the Smithsonian and the American Philosophical Society (this latter one is where Sapir’s notes are and Sapir studied lots of languages around the turn of the last century and took really good notes).

California has historically had the greatest density of native languages and folks at Berkeley have been archiving stuff for a long time. There are four main archives:

  • P.A. Hearst Museum of Anthropology (it’s got pre-1950 audio stuff).
  • Bancroft Library (paper stuff, pre-1950)
  • Post-1950, you can consult the Berkeley Language Center (audio) and the Survey of California and Other Indian Languages (paper stuff). They’ve recently combined their catalogs to make searching easier: http://cla.berkeley.edu. There are a lot of digital resources here (scanned images and digital audio).
    • (It sounds like you can find records of what’s at the Hearst using CLA, too.)

Regional archives also have surprising stuff. Justin gives two examples:

  • Pliny Earle Goddard’s materials on Californian Athabaskan languages are mostly at Bancroft and the APS, but his Lassik notebooks are at the University of Washington (Melville Jacobs Papers collection, they are apparently marked up with Harry Hoijer’s annotations).
  • J.P. Harrington’s archives are mostly at the National Anthropological Archives, but the Barbareno Chumash materials are in the Santa Barbara Museum of Natural History.

Finally, Justin says the best summary of archival materials for languages of California is in Victor Golla’s recent book:

 

African language corpora

8 Feb

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest).

And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) squat about any African languages. My hope is this post will be useful for Africanists who are curious about big collections of spoken/written texts. But even more so, I’m hoping that researchers who stick to English and other languages with easy, familiar corpora will take an interest and start playing with the data. Fwiw, people who study African languages are really friendly–so if you’re, say, a computational linguist, it’ll be no problem to find you a buddy.

A number of the resources below should help you find out more about areas/languages outside of Africa, too.

Why study African languages?

Larry Hyman gives some answers here (my favorite part starts on page 14–Leggbo has subject-verb-object word order for affirmative sentences, but subject-object-verb word order in the negative). The Economist gives its answers here (of note: 600 million mobile phone users in Africa–more than Europe or America).

My own answers come from an initial shock of studying Zulu after getting lessons as an anniversary gift. The clicks are wonderful, it’s true. But have you seen the morphology? Zulu is a temptress. Okungumthakathi kuyangikhwifa (‘The damn witch is bewitching me.’)

My more recent work has taken me to Ethiopia–also a beautiful country and packed with very different, very understudied languages. In one village I get to work on three completely unrelated languages (Shabo, Majang, Shekkacho). It’ll be a while before there are any corpora of size on these languages (Shabo has 300 speakers), but Amharic is a major language with a sizable population. It’s Semitic, so if you squint your eyes, you can see a relationship to Arabic and Hebrew, but Amharic is more distant than those two. Cushitic influenced it a lot, so you get subject-object-verb order, for example. You also get ejectives (make a plosive sound like t/k/p/d, there–you breathed through them. Ejectives are like those but just use the air stored above your glottis).

Finding African language corpora

First a request: if you know of corpora based on naturalistic spoken conversations, please let me know.

Kevin Scannell makes several great resources available for people studying less common languages:

It’s worth trucking over to African Language Technology and searching their archives/posting questions. Right off the bat, it has a number of resources, including:

And it’s also worth checking out all the different resources at OLAC.
For Zulu, check out the Ukwabelana Corpus, prepped by folks doing computational linguistics.

For a treasure-trove of Ndebele/Zulu lexical info–and a huge number of other Bantu languages–take a look at CBOLD (if you have issues, there’s an old version at Berkeley).

ALLEX offers resources for Ndebele, Shona, and Nambya.

Moving away from Bantu, for Amharic, there’s a whole site for resources to help you: http://corpora.amharic.org/. (Speaking of Semitic languages, if you want Arabic, check out my earlier post here).

If you’re interested in Hausa, a good place to start is Uwe Seibert’s Hausa Online.

Also don’t forget that resources like the BBC have written/spoken content in Hausa, Somali, Swahili, Kinyarwanda/Kirundi.

Personally, I’m fascinated by this resource on information structure for 25 sub-Saharan languages.

DoBeS has a number of languages archived, those these tend to be small, highly endangered languages, so if you only want giant amounts of text, this is the wrong place, probably. On the other hand, if you want to have a huge impact…EMELD is another place people archive information.

From the LDC you might be interested in:

To make your own corpus from web texts, consider CorpusCollie (the example they use is Luo).

Arabic corpora

6 Feb

Before talking about specific Arabic resources, let me suggest some search engines that will be useful for Arabic–and many other languages, too:

If you want to check out which LDC corpora we have that have Arabic, run a search here: http://www.ldc.upenn.edu/Catalog/catalogSearch.jsp.

Another good place to search is LRE Map (“Language Resources and Evaluation), which collects info from various NLP conferences. The interface is somehow both simple and confusing. To find Arabic resources, click “Lang” in the top row and then look over to the left-hand panel. Leave “Resource Name” blank and just use the “Resource Language” drop-down menu: http://www.resourcebook.eu/LreMap/faces/views/resourceMap.xhtml

A lot of corpora you find in Arabic are fairly formal since they come from news reports. I’m going to focus mostly on conversational stuff. First stuff from the LDC:

  • Egyptian Arabic CALLHOME and CALLFRIEND. The CALLHOME corpus involves 120 half-hour conversations between native Egyptian Arabic speakers . 5-10 of those minutes are transcribed (but you can find various parts of CALLHOME repurposed and transcribed in different ways, for example in the 1997 HUB5 and 2003 NIST collections from the LDC). Note that there is a “supplement” that gives another 20 conversations. The CALLFRIEND corpus involves 60  Egyptian Arabic conversations between people living in the US. I haven’t found as much transcription and processing of it, so probably lean towards CALLHOME.
  • Fisher Levantine Arabic. The Fisher method (the English version is a great resource, too, worth considering as a replacement for Switchboard stuff, btw) is to have strangers call and talk to each other. Here the data is mostly from folks around Jordan. Each pair of strangers talks about a specified topic for a while, so you get interesting topic and demographic information.
  • Gulf Arabic phone calls–975 speakers engaged in spontaneous conversations lasting about six minutes each. There’s also a Levantine version of this and a version of this for Iraqi Arabic. The Levantine and Gulf Arabic corpora are about the same size, the Iraqi Arabic one has about half the number of speakers.
  • 901 phone calls, mostly between Arabic speakers from Lebanon.
  • OntoNotes 4.0 isn’t really conversational, but it is parsed and may be useful.

Now for some stuff outside of LDC:


							

COCA: What a fantastic source of data!

3 Feb

Intro

425 million words from 1990-2011.

I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts of data. It’s a lot bigger than most of the corpora you may be going to now (CELEX, Switchboard, etc). And while its sentences aren’t annotated with tree structures, it does have part-of-speech info (and makes it really easy to get collocates).

This post is really about getting started with COCA, but I’ll try to do it in the framework of a particular linguistic phenomenon. But if you get nothing out of this post but USE COCA, that’ll be enough. It also makes it easy to compare to historical American English, the BNC, and Google Books/N-gram (though I won’t be showing that here).

Wh-exclamatives

A few months ago Anna Chernilovskaya came by and presented work she and and Rick Nouwe had been doing on exclamatives like:

(1) What a beautiful song John wrote!

Their work is set against Rett (2011), which sees wh-exclamatives as being a speaker expressing something that is noteworthy in the given context–in other words, to say (1) you think there’s something noteworthy about John’s song relative to the standard beauty of songs.

If you don’t have a degree adjective like “beautiful” (instead something like What a song John wrote!), Rett says you have an operator that acts like a silent adjective, so you’re exclaiming about beauty, weirdness, or complexity. Chernilovskaya and Nouwe are saying, “Nah, it’s simpler–it’s just direct noteworthiness. Drop all this degree stuff.”

My question: how are wh-exclamatives actually used by English speakers? My intuition is with C&N that it’s just about noteworthiness. But is that all?

First steps with COCA

Go to http://corpus.byu.edu/coca/, look over to the upper right–you can log in or register, as appropriate.

Most of the action is going to be in the panel on the left. Some of the stuff is hidden to reduce complexity, so if you want to see part-of-speech tags, just click on the text that says “POS LIST” and you’ll get a drop-down menu you can choose stuff from. If you want to do collocation stuff, just click “COLLOCATES”, etc.

COCA's left-pane

Let’s go ahead and start collecting examples of wh-exclamatives. In the “WORD(S)” text box, we could type:

what  [at*]

And that would get us 82,074 sentences, including both what a and what an. It would also get us what the, what every, and what no. Including sentences like Her would-be opponents are pondering these questions and what the answers mean for their own possible candidacies.

That’s rather too much. Let’s try the following–the [y*] means “any punctuation”.

[y*] what [at*]

Now we have 18,063 sentences. This may help us see that I mean, what the heck is interesting, but perhaps it’s still taking us too far afield. Let’s just do what you are probably thinking we should’ve done from the beginning:

[y*] what a|an

8,919 results. Looking through the results, this is a pretty good query. We could restrict which punctuation we care about, but if you go ahead and do this query yourself, I think you’ll see why we want to keep most punctuation.

Search results in COCA (click to make 'em bigger)

The top right box shows all the matches lumped together by punctuation/article, click on any of them and the actual sentences will show up below. Notice the drop-down Help box on the far right above the sentences. Use that to find out more about query syntax.

Noteworthiness

Let’s see what the most common words are that go with a what a|an construction. We have a few options, we can run queries like:

what a|an [j*] [n*]
what a|an [j*]
what a|an [n*]

This third one is pretty interesting because it helps us see which sorts of exclamations are made without an adjectives. (One person in our semantics/pragmatics group confided that his father always said What a baby! when confronted by an unattractive child–there’s a pressure to exclaim something but he doesn’t want to lie, so this strategy manages to give the right form without quite meaning the meaning that the parents might take away.)

Here are the top 15, though the first one ends up not really being part of our pattern a lot of the time.

1  WHAT A LOT 507
2  WHAT A DIFFERENCE 321
3  WHAT A WASTE 187
4  WHAT A MAN 183
5  WHAT A SHAME 168
6  WHAT A MESS 159
7  WHAT A PERSON 156
8  WHAT A RELIEF 152
9  WHAT A DAY 150
10  WHAT A SURPRISE 142
11  WHAT A WOMAN 126
12  WHAT A PLEASURE 117
13  WHAT A WAY 104
14  WHAT A STORY 101
15  WHAT A PITY 95

And in case you were curious about the one-word exclamatives with actual exclamation marks:

what a|an [n*] !
1  WHAT A MESS ! 30
2  WHAT A SURPRISE ! 27
3  WHAT A RELIEF ! 26
4  WHAT A SIGHT ! 23
5  WHAT A DAY ! 22

One of the things C&N say Rett can’t handle is something like What an extremely nice man since the extremely and nice should interfere on Rett’s account. You can’t say *John is more extremely nice than Bill or *John is too extremely nice. How does this pattern work in the data?

what a|an [r*]
1  WHAT A VERY 24
2  WHAT A TRULY 13
3  WHAT A REALLY 12
4  WHAT A GREAT 10
5  WHAT AN ABSOLUTELY 9
6  WHAT AN INCREDIBLY 9
7  WHAT A WONDERFULLY 7

Now another way of looking at stuff is to look for collocational strength.

Getting collocation info

The important stuff here is that I clicked on “COLLOCATES” and put in the part of speech (adverb=[r*]) and chose the window I was looking at–in this case two the right. I also adjusted the “MINIMUM” to be based on mutual information and I set it to ignore things with a mutual information of less than 2 (a standard strength measure is 3.0, but I wanted to get a few more than that).

A few other things:

  • You may want to restrict yourself to just SPOKEN stuff (that’s in the middle of the left-pane).
  • If you have a big query you probably want to change # HITS FREQ to something big (the default is 100).
  • Often it’s more useful to GROUP BY lemmas than words (though here it doesn’t matter, think about if I were doing something about verbs)
  • If you choose SAVE LISTS, you’ll get prompted to enter a list name ABOVE the top results. It’s really easy to miss.

But back to the results. The adverbs with the highest mutual information are truly, incredibly, wonderfully, extraordinarily, and remarkably, though the absolute counts are pretty low. Still clicking around on examples may help.

Now if we do adjectives, we get these results:

Num Adj CtTogether AdjCt Perc MI
1  [GREAT] 428    248,858 0.17 5.4
2  [WONDERFUL] 263      29,277 0.9 7.78
3  [BEAUTIFUL] 161      43,750 0.37 6.5
4  [GOOD] 133    409,451 0.03 2.99
5  [LOVELY] 102      10,246 1 7.93
6  [NICE] 100      50,448 0.2 5.6
7  [TERRIBLE] 84      20,290 0.41 6.67
8  [STRANGE] 78      26,432 0.3 6.18
9  [AMAZING] 60      17,204 0.35 6.42
10  [STUPID] 47      13,524 0.35 6.41

Notice how exclamatives skew positive. (That’s why the What a baby! trick works!)

And nouns, though let’s increase the window to 4 to the right.

Num Noun CtTogether NounCt Perc MI
1  [THING] 264    438,956 0.06 2.88
2  [DAY] 242    486,452 0.05 2.61
3  [IDEA] 209    133,349 0.16 4.27
4  [WAY] 198    521,448 0.04 2.22
5  [DIFFERENCE] 195      89,269 0.22 4.74
6  [WASTE] 166      31,419 0.53 6.02
7  [SURPRISE] 165      35,267 0.47 5.84
8  [MAN] 153    460,880 0.03 2.03
9  [SHAME] 151        9,431 1.6 7.62
10  [STORY] 148    178,875 0.08 3.34

Noteworthiness?

Here is C&N’s definition of noteworthiness:

an entity is noteworthy iff its intrinsic characteristics (i.e. those char-
acteristics that are independent of the factual situation) stand out con-
siderably with respect to a comparison class of entities (C&N 2012: 5).

In the last section, we saw that what a (adj) story was a prominent use (#10). If we restrict ourselves just to the spoken portion of the corpus, this leaps up to #1. That’s because the spoken portion comes from talk shows and news programs (like Good Morning America, Dateline, and Larry King). If you look at the transcripts–and if you have ever listened to American news, you’ll know that what a (great/emotional/amazing/astonishing/inspiring) story comes up usually after the story is done and a segue is happening. And this is also true for how what a pleasure and many of the other items are. What a is used in these talk show/news programs as a way of simultaneously evaluating and moving between topics (usually out of, but also sometimes into).

All of these makes me skeptical about the definition C&N provide.

Consider this Good Morning American clip from last August (fast forward to about 56:20), where two stories, back-to-back are described in terms “what a” noteworthiness:

{Story about a woman surviving in the wilderness for 3 days}

Thank you, David.

What a story.

And what a story we have coming up for you.

{Uh, that’s about the making of a boy band.}

I would contend that these stories are not really that noteworthy (they also occur at the tail end of the show and so may be the most cuttable if things earlier had gone long). You may or may not agree with me. But at a minimum we probably need to say that such exclamatives are claims about noteworthiness, not factual observations about things that are intrinsically noteworthy. Any sort of judgment about noteworthiness has to have a judge, so that seems to be a problem for arguments about intrinsic qualities.

Part of Rett’s discussion does have the speaker in the mix (pages 4-5), but then towards the end of the paper she says of

(2) How very unexpected John’s news is!

(3) What a surprise John’s height is!

“To the extent that they sound natural, are interpreted as reflecting an objective surprise or unexpectedness rather than one oriented to the speaker” (Rett 2011: 19). Her main point is that gradable properties get their values from context–it’s not that they reflect the speaker’s attitude.

I’m a big fan of information theoretic accounts of language, which gives measures of surprise. The surprise of “x” and “y” co-occurring is based on prior probabilities of them occurring separately and together. But the truth is that they are always defined against some perceiver’s experience.  Psycholinguists use corpora to estimate how surprising word y is following word x, but if some subject had a remarkably different experience with “x” and “y” than most of us, well, we’d expect effects to be different.

After looking through the actual uses of what a, I propose you HAVE to build in the speaker. And what is more, these what a sentences really are doing more than just expressing an observation of the world. More than expressing an internal state. And more than just an evaluation. They are social in their nature (what a (stupid/amazing/dumb) thing to say), so I would contend that theories should also look at consequences of the use in terms of the relationship between the speaker and their audience. My inclination is also to believe that we ought to say something about how they tend to skew positively in terms of adjectival collocates and probably how they hold that positive-skew as a default interpretation even when there’s no adjective, as in the what a baby! example.

But what a long post this is. I’ll stop.