23 Feb

Ooh, check out the Cornell Movie Dialogs Corpus that Lillian Lee and Cristian Danescu-Niculescu-Mizil have made available! (Here’s their work on accommodation/priming/engagement, including their rationale for using a movie corpus.)

The corpus features conversations between over 9,000 characters in 617 movies. D-N-M&L have marked it up with a lot of interesting information: the gender of who’s talking, what position they are in the film credits, what genres and ratings the movie gets in IMDB, etc. In this post I’m going to look at like and I mean.

The distribution of like

We really want to focus on “discourse like” (as in, That’s, like, awesome). To get rid of examples like the one in this sentence here or in I like this corpus, I restrict myself to examples of like that have a comma after the like (using a comma-before gets too many “I should’ve married some in the family, like you” matches). There are 346 lines that match.

The first thing you’re going to guess is that this is going to occur a lot in comedies–and you’re right. It occurs in comedies 1.64 times more often than we’d expect if it were just distributed across genres by chance. Maybe it’ll surprise you more to know that it also occurs a lot in “Crime” genre films. Discourse like does NOT like to occur in action/adventures or mysteries, though.

You’re probably also going to guess that female characters use it more than male characters–and you’re right. In fact, it’s when a female character is talking to another female character that they use it the most. But the counts are kind of low here since the corpus is not completely gender-annotated.

I didn’t really have a guess about whether protagonists would be using it more than minor characters. Just taking “is a character higher up in the credits talking to a character lower down in the credits”, we see that the more “important” a character is, the MORE they use discourse like. The effect is especially strong if the person talking is first or second in the credits and they’re talking to someone who appears fifth or lower.

Note that using position as a measure of character importance is a little tricky. For example, Melissa McCarthy is nominated for an Oscar this year for Best Supporting Actress in Bridesmaid–but she’s listed 16th in the credits. And Gary Oldman is up for Best Actor for Tinker, Tailor, Soldier, Spy and he’s actually 7th in the credits there. But these are outliers. Mostly, the characters with the most screen time and the biggest bang are higher up in the credits (this is confounded with the fact that actors/actresses have something to do with the credit-rolls, too).

The use of I mean

There are 2,353 I mean‘s in the corpus.

The movies put the most I mean‘s in the mouths of female characters–they use about the same rate whether they’re talking to men or women. Male characters speaking to female characters also use a fair amount of I mean–which really means that the “odd ball” group is the males-speaking-to-males. They’re they only group that’s constrained against using I mean (compared to what would’ve happened at chance).

In terms of genre, it’s in romances, comedies, and dramas that you get the most I mean‘s and in thrillers where you get the least.

Comparing the interlocutors’ positions in the credits, there’s not as much of a hiearchy thing happening with I mean. One thing that is strange is that while characters who are first in the credits use about as much I mean as we’d predict (based on their overall line counts and the overall percentage of lines-anyone-has with I mean), the characters in the second position are using A LOT of I mean.

When we look for interactions between credit-position and genre, we generally see that these characters do the same thing. That is, neither the 1st or 2nd person in the credits of thriller is using much I mean. But both 1st and 2nd positions are using a lot of I mean in dramas.

They part ways in comedies and sci-fi. In comedies, the 1st position uses a lot of I mean, while the 2nd credited character uses very little. My sense is that I mean is a great resource in comedies for someone who has to explain themselves a lot and that’s the main protagonist in a comedy–they’re the ones who are put in spots that require clarification:

JUNO: My dad went through this phase where he was obsessed with Greek and Roman mythology. He named me after Zeus’s wife. I mean, Zeus had other lays, but I’m pretty sure Juno was his only wife. She was supposed to be really beautiful but really mean. Like Diana Ross.

In sci-fi, it’s reversed. The 1st-credited characters are very restricted from using I mean, while the 2nd-credited characters use it A LOT. This has something to do with explaining yourself again, but genre conventions are different. The hero of a sci-fi movie doesn’t do a lot of I mean’s.

You don’t get Ripley from Aliens talking about I mean (she has one example, but it’s That’s not what I mean). By contrast, “Newt”, the young girl who is the colony survivor (and who is #2 in the credits), says:

NEWT: Isn’t that how babies come? I mean people babies…they grow inside you?

Hm. I guess I’m going to stop now…with a really creepy line if you know anything about its context.

[Update 2/29/2012: After I published this article, I started to wonder whether the quote I gave for Newt really counts as a good example of discourse “I mean”. I think the findings are still true, but this super-cool example may not be as super-cool as I initially thought. What do you think?]


