Tag Archives: english

Tweet parser and word clusters

Brendan O’Connor & Co. from CMU have updated their tweet parser and provided a bunch of other stuff, including a collection of 56 million English-language tweets.

http://www.ark.cs.cmu.edu/TweetNLP/

They’ve also done some clustering work on the words. Some of their clusters make a lot of sense immediately:

haven’t havent shoulda would’ve should’ve hadn’t woulda could’ve coulda havnt shouldve wouldve must’ve musta couldve haven’t havn’t hadnt might’ve hvnt mustve shuda wudashudda wudda shulda wulda mighta cudda have’nt wudve shudve hvent #glocalurban hadn’t haven`t mightve shlda haven´t culda should’ve wlda avnt would’ve hvn’t may’ve cudveshldve have’t could’ve

Others are intriguing, for example, I believe gaydar may be an actual body part, given its cluster:

body brain soul skin stomach throat belly tummy ego imagination gut liver jaw spine bladder handwriting scalp body’s subconscious uterus complexion stomache eyesight naveltorso palate bodys demeanor physique waistline clitoris abdomen spleen gaydar gallbladder pocketbook bdy bodyy tummy’s tailbone ringback ribcage cervix skinn throat’sescentuals skin’s sternum ellum cell’s

Btw, look at all the ways to put lol in the past tense!

looked felt laughed yelled tasted screamed smiled smelled acted shouted stared waved lol’d smelt bitched giggled winked loled lookd behaves glanced chuckled honked barkedmoaned growled peeked blushed beeped lol’ed squealed gasped hollered cringed whistled whined glared lold grinned smirked hissed snored lolled holla’d lol-ed laffed meowedstuttered groaned flinched

The clusters in HTML: http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html

Tags: english, social media, Twitter

Comments Leave a Comment
Categories Uncategorized

Power (Supreme Court Justices and Wikipedia editors)!

19 Sep

Cristian Danescu-Niculescu-Mizil, Timothy Hawes, and colleagues have released some more corpora that are worth playing with.

The Wikipedia Talk Page Conversations Corpus: 125,000 conversations involving about 30,000 editors. Metadata such as editor’s status, time of status change and gender is included.
Supreme Court Dialogs Corpus: oral arguments making up 51,498 utterances (50,389 conversational exchanges); 204 cases with 11 justices and 311 other participants (lawyers, for example). You get case outcome, justice vote, gender annotation, etc.

These corpora support really interesting work about how accommodation and power go together.

One of my interests is how the word little gets used in terms of power relationships (see also my dissertation). I did a quick look through the Supreme Court. (Very very quick, so I make no conclusions, here, just report some numbers.)

Justices Breyer and Ginsburg both talk a lot during oral arguments (their speech represents 15.51% and 11.37% of all of the speech-of-justices that are in the corpus). There are 232 turns involving the word “little”. That’s not quite as much as I’d like to make really strong claims. But these judges use the word at very different rates: Breyer uses it 71 times–nearly twice as often as we’d expect if little was distributed across the justices based on their total number of turns. Ginsburg, on the other hand, only uses it 10 times, that’s 38% of what we might have expected.

I’m not going to offer any analysis here, but I do want to give some examples of the work that little does–it is sometimes dismissive, sometimes hedging (it is especially likely to hedge states that the justices are claiming for themselves, it seems). Here are a few examples from Breyer:

“they have a little paragraph of explanation”
“now, that’s a little tough”
“you put a little thing in the corner”
“I’m a little nervous about it”

Other justices talk about being a little puzzled, confused, etc. But notice that Ginsburg never pairs little with any kind of mental state. The very closest she comes is something pretty far off (“I” is not in the utterance): “It’s — given that we’re dealing with sophisticated judges, the same panel in both episodes, it’s a little hard to — to see where the due process violation is.”

Btw, the string “I mean,” is used 2,020 times. Justices and non-justices speak roughly the same amount of time but the justices are the ones who are using “I mean,” much more (in terms of utterances that have the phrase in it, justices use 1.4 times more often than we’d expect them to; 2.6 times more often than the non-justices who talk). Wanna know who is the biggest “I mean,”‘er? Have a guess?

Tags: english, law, power, Wikipedia

Comments 1 Comment
Categories Uncategorized

Super-European language translation corpus

15 Apr

The DGT-TM corpus takes sentences from 22 European languages and has translations (manually produced) for 21 other languages. That means there are about 3 million sentences per language.

With all pairs among these 22 languages:

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish

Yes–you got that, right, it’s all European, but it’s not *just* Indo-European.

http://langtech.jrc.ec.europa.eu/DGT-TM.html

It’s triple the size that it was in 2007.

The corpus is made up of all European legislation: treaties, regulations, directives, etc. (If you join the EU you have to accept the whole Acquis Communautaire, so you have to get it translated into your official languages.)

Paper: Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos & Patrick Schlüter (2012).

Tags: Bulgarian, Czech, Danish, Dutch, english, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish, translation

Comments Leave a Comment
Categories Uncategorized

Verb phrase ellipsis corpus

14 Apr

One of the hot topics in syntax dissertations in recent years has been ellipsis (I’m not kidding–when I was on a job search committee a few years ago, ellipsis was *everywhere*).

Bos and Spenader (2011) have an annotated corpus!

Bos, Johan and Jennifer Spenader. 2011. An annotated corpus for the analysis of VP ellipsis. Language Resources and Evaluation 45(4): 463–494.

Tags: ellipsis, english

Comments Leave a Comment
Categories Uncategorized

Names

11 Apr

Stephanie Shih has been doing really fun work on what makes a name (first and last) using a corpus of Facebook names. This helps her get recent trends–the Social Security Administration releases first names all the time, but it doesn’t release first+last until 100 years after the birth certificates come in.

Stephanie’s talk at the LSA

Amaç Herdagdelen has compiled Census data from 1990 and put it with the Social Security Administration’s statistics for popular baby names for every year between 1960 and 2010:

Data, code: https://github.com/amacinho/Name-Gender-Guesser
- Orwant and Daly’s older Perl module has fuzzy search capabilities (phonetic similarity of names): http://search.cpan.org/~edaly/Text-GenderFromName-0.32/GenderFromName.pm
Paper: http://clic.cimec.unitn.it/amac/twitter_ngram/Herdagdelen2012-RTC-draft.pdf

Tags: America, census, english, gender, geography, names, phonology, phonotactics, social security

Comments Leave a Comment
Categories Uncategorized

World Englishes

9 Apr

Elizabeth Traugott offers these suggestions for corpora on world Englishes:

eWAVE = The electronic World Atlas of Varieties of English. 2011. Edited by Bernd Kortmann and Kerstin Lunkenheimer. Leipzig: Max Planck Institute for Evolutionary Anthropology. http://www.ewave-atlas.org/.
ICE = The International Corpus of English, version 2. 2006. Coordinated by Gerald Nelson (University of Hong Kong). http://ice-corpora.net/ice/index.htm.
ONZE = Origins of New Zealand English Corpus. In progress. Compiled by the ONZE project team. University of Canterbury. http://www.lacl.canterbury.ac.nz/onze/index.html.
SAVE = The South Asian Varieties of English Corpus. 2011. Compiled by Joybrato Mukherjee, Tobias Bernaisch, Christopher Koch, and Marco Schilk. University of Giessen. http://www.uni-giessen.de/cms/faculties/f05/engl/ling/research/save.

Tags: Aboriginal E, Acrolectal Fiji E, alternations, Appalachian E, Australian E, Australian Vernacular E, Bahamian C, Bahamian E, Bangladesh, Barbadian C (Bajan), Belizean C, Bislama, Black South African E, British Creole, Butler E, Cameroon E, Cameroon Pidgin, Canada, Channel Islands E, Chicano E, Colloquial American E, Colloquial Fiji E, Colloquial Singapore E, dative, dialects, ditransitive, Earlier African American Vernacular E, East Africa, East Anglia, Eastern Maroon C, english, Falkland Islands E, Ghanaian E, Ghanaian Pidgin, Great Britain, Gullah, Guyanese C, Hawaiian C, Hong Kong, Hong Kong E, India, Indian E, Indian South African E, Ireland, Irish E, Jamaica, Jamaican C, Jamaican E, Kenyan E, Krio, Liberian Settler E, Malaysian E, Maldives, Manx E, Nepal, New Zealand, New Zealand E, Newfoundland E, Nigerian E, Nigerian Pidgin, Norf'k, North of England, Orkney and Shetland E, Ozark E, Pakistan, Pakistan E, Palmerston E, Philippines, Roper River C (Kriol), Rural African American Vernacular E, San Andrés C, Saramaccan, Scottish E, SE of England, Singapore, South Asia, Southeast American Enclave dialects, Sranan, Sri Lanka, Sri Lanka E, St. Helena E, SW of England, Tanzanian E, Tok Pisin, Torres Strait C, Trinidadian C, Tristan da Cunha E, Ugandan E, Urban African American Vernacular E, USA, Vernacular Liberian E, Vincentian C, Welsh E, White South African E, White Zimbabwean E, world, [Maltese E]

Comments Leave a Comment
Categories Uncategorized

Erorrs erorrs evrerywehere

7 Apr

Earlier I showed a neat resource that has grammatical and ungrammatical sentences from various linguistics papers over the years. But what if you want a whole bunch of English errors?

In that case, check out work that David Hale and Adam Kilgarriff are putting together: http://clt.mq.edu.au/research/projects/hoo/.

You might also look around for ESL corpora, for example:

CK Jung’s lab has a one-million word written corpus of Korean learners of English (“YELC”): http://www.uclouvain.be/en-cecl-lcworld.html.
Izumia, Uchimotoa, and Isaharaa (2004) have a Japanese Learner English Corpus, too (NICT JLE), but I think you have to email them.

Tags: english, errors, ESL, Japanese, Korean, ungrammatical

Comments Leave a Comment
Categories Uncategorized

Making a corpus from YouTube: dialects in North America

11 Mar

Here is a link to Rick Aschmann’s amazing collection of speech clips from Canadian and American speakers on YouTube using the Atlas of North American English as a starting point:

http://aschmann.net/AmEng/#SmallMapUnitedStates

Aschmann’s work is a great example of how to use YouTube–you should also be aware that YouTube allows users to subtitle and to caption clips, which means that you can potentially find words and expressions in particular languages AND/OR translations into other languages.

You may want to get all the YouTube stuff into wave forms that you can analyze. Here are my instructions for how to get YouTube clips into Praat. Basically, you’ll capture the video and then convert the video into audio. Note that YouTube does involve compression, so it’s not the same as a lossless recording. That may or may not be important depending upon the phenomena you’re studying:

http://www.stanford.edu/~tylers/notes/socioling/How_to_get_YouTube_into_Praat.pdf

Tags: American English, Atlas of North American English, audio, Canadian English, convert, dialects, download, english, how-to, Praat, sociolinguistics, video, YouTube

Comments Leave a Comment
Categories Uncategorized

Prosodically annotated corpora

8 Mar

Here’s a summary of corpora to check out if you’re interested in prosody. It’s really English-heavy. Send me ideas for non-English sources that are annotated!

For ToBI marked stuff:

The Boston University Radio Speech Corpus will get you student hosts reading the news. The transcripts are marked up with prosodic information (ToBI) for about 3.5 hours worth of data. One nice thing is that it has inter-rater reliability information on the prosodic annotations (see Hasegawa-Johnson et al., 2005 for more about that and an example of research using the corpus).
There’s also ToBI annotation for 75 Switchboard conversations in the NXT edition: http://groups.inf.ed.ac.uk/switchboard/

Other annotation systems:

You might check out the Santa Barbara Corpus is free now and is a great source for prosody research since it’s naturalistic and has a lot of different kinds of people talking in a lot of different situations. I’m not sure if anyone has ever annotated it with ToBI but the transcripts themselves have a host of prosodic cues.
The London-Lund Corpus has a lot of prosodic annotation, too.
The Hong Kong Corpus of Spoken English is naturalistic in that it’s all from real-life stuff (interviews, presentations, etc). You can get a flavor of it here but to get all the prosodic information, you need to get the book, here. It uses David Brazil’s Discourse Intonation system (prominence, tone, key, termination).
There’s also the Aix-MARSEC database, which is five hours of spoken British English with phonemes, syllables, syllable constituents, rhythm units, stress feet, words, and intonation units all marked up. (Get the data here, ready for Praat.)
The Wellington Corpus of Spoken New Zealand English has New Zealand English with emphatic stress marked.
The IViE corpus is labeled prosodically, too.

More of a stretch is the Audiovisual Database of Spoken American English. I don’t think most of you interested in prosody will care about this corpus, but I include it just in case.

Finally, in the universe of emotion and prosody, you can try out:

(See my previous posts on emotion here and here for other resources–note that the two above are both “acted”.)

Tags: english, Mandarin, prosody, stress, tobi

Comments 1 Comment
Categories Uncategorized

Like, let’s go to the movies, I mean…

23 Feb

Ooh, check out the Cornell Movie Dialogs Corpus that Lillian Lee and Cristian Danescu-Niculescu-Mizil have made available! (Here’s their work on accommodation/priming/engagement, including their rationale for using a movie corpus.)

The corpus features conversations between over 9,000 characters in 617 movies. D-N-M&L have marked it up with a lot of interesting information: the gender of who’s talking, what position they are in the film credits, what genres and ratings the movie gets in IMDB, etc. In this post I’m going to look at like and I mean.

The distribution of like

We really want to focus on “discourse like” (as in, That’s, like, awesome). To get rid of examples like the one in this sentence here or in I like this corpus, I restrict myself to examples of like that have a comma after the like (using a comma-before gets too many “I should’ve married some in the family, like you” matches). There are 346 lines that match.

The first thing you’re going to guess is that this is going to occur a lot in comedies–and you’re right. It occurs in comedies 1.64 times more often than we’d expect if it were just distributed across genres by chance. Maybe it’ll surprise you more to know that it also occurs a lot in “Crime” genre films. Discourse like does NOT like to occur in action/adventures or mysteries, though.

You’re probably also going to guess that female characters use it more than male characters–and you’re right. In fact, it’s when a female character is talking to another female character that they use it the most. But the counts are kind of low here since the corpus is not completely gender-annotated.

I didn’t really have a guess about whether protagonists would be using it more than minor characters. Just taking “is a character higher up in the credits talking to a character lower down in the credits”, we see that the more “important” a character is, the MORE they use discourse like. The effect is especially strong if the person talking is first or second in the credits and they’re talking to someone who appears fifth or lower.

Note that using position as a measure of character importance is a little tricky. For example, Melissa McCarthy is nominated for an Oscar this year for Best Supporting Actress in Bridesmaid–but she’s listed 16th in the credits. And Gary Oldman is up for Best Actor for Tinker, Tailor, Soldier, Spy and he’s actually 7th in the credits there. But these are outliers. Mostly, the characters with the most screen time and the biggest bang are higher up in the credits (this is confounded with the fact that actors/actresses have something to do with the credit-rolls, too).

The use of I mean

There are 2,353 I mean‘s in the corpus.

The movies put the most I mean‘s in the mouths of female characters–they use about the same rate whether they’re talking to men or women. Male characters speaking to female characters also use a fair amount of I mean–which really means that the “odd ball” group is the males-speaking-to-males. They’re they only group that’s constrained against using I mean (compared to what would’ve happened at chance).

In terms of genre, it’s in romances, comedies, and dramas that you get the most I mean‘s and in thrillers where you get the least.

Comparing the interlocutors’ positions in the credits, there’s not as much of a hiearchy thing happening with I mean. One thing that is strange is that while characters who are first in the credits use about as much I mean as we’d predict (based on their overall line counts and the overall percentage of lines-anyone-has with I mean), the characters in the second position are using A LOT of I mean.

When we look for interactions between credit-position and genre, we generally see that these characters do the same thing. That is, neither the 1st or 2nd person in the credits of thriller is using much I mean. But both 1st and 2nd positions are using a lot of I mean in dramas.

They part ways in comedies and sci-fi. In comedies, the 1st position uses a lot of I mean, while the 2nd credited character uses very little. My sense is that I mean is a great resource in comedies for someone who has to explain themselves a lot and that’s the main protagonist in a comedy–they’re the ones who are put in spots that require clarification:

JUNO: My dad went through this phase where he was obsessed with Greek and Roman mythology. He named me after Zeus’s wife. I mean, Zeus had other lays, but I’m pretty sure Juno was his only wife. She was supposed to be really beautiful but really mean. Like Diana Ross.

In sci-fi, it’s reversed. The 1st-credited characters are very restricted from using I mean, while the 2nd-credited characters use it A LOT. This has something to do with explaining yourself again, but genre conventions are different. The hero of a sci-fi movie doesn’t do a lot of I mean’s.

You don’t get Ripley from Aliens talking about I mean (she has one example, but it’s That’s not what I mean). By contrast, “Newt”, the young girl who is the colony survivor (and who is #2 in the credits), says:

NEWT: Isn’t that how babies come? I mean people babies…they grow inside you?

Hm. I guess I’m going to stop now…with a really creepy line if you know anything about its context.

[Update 2/29/2012: After I published this article, I started to wonder whether the quote I gave for Newt really counts as a good example of discourse “I mean”. I think the findings are still true, but this super-cool example may not be as super-cool as I initially thought. What do you think?]

Tags: discourse, english, i mean, like, movie

Comments 1 Comment
Categories Uncategorized

← Older Entries

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Power (Supreme Court Justices and Wikipedia editors)!

Super-European language translation corpus

Verb phrase ellipsis corpus

World Englishes

Erorrs erorrs evrerywehere

Making a corpus from YouTube: dialects in North America

Prosodically annotated corpora

Like, let’s go to the movies, I mean…

The distribution of like

The use of I mean

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?