Tag Archives: english

Tweet parser and word clusters

22 Sep

Brendan O’Connor & Co. from CMU have updated their tweet parser and provided a bunch of other stuff, including a collection of 56 million English-language tweets.

They’ve also done some clustering work on the words. Some of their clusters make a lot of sense immediately:

  • haven’t havent shoulda would’ve should’ve hadn’t woulda could’ve coulda havnt shouldve wouldve must’ve musta couldve haven’t havn’t hadnt might’ve hvnt mustve shuda wudashudda wudda shulda wulda mighta cudda have’nt wudve shudve hvent #glocalurban hadn’t haven`t mightve shlda haven´t culda should’ve wlda avnt would’ve hvn’t may’ve cudveshldve have’t could’ve

Others are intriguing, for example, I believe gaydar may be an actual body part, given its cluster:

  • body brain soul skin stomach throat belly tummy ego imagination gut liver jaw spine bladder handwriting scalp body’s subconscious uterus complexion stomache eyesight naveltorso palate bodys demeanor physique waistline clitoris abdomen spleen gaydar gallbladder pocketbook bdy bodyy tummy’s tailbone ringback ribcage cervix skinn throat’sescentuals skin’s sternum ellum cell’s

Btw,  look at all the ways to put lol in the past tense!

  • looked felt laughed yelled tasted screamed smiled smelled acted shouted stared waved lol’d smelt bitched giggled winked loled lookd behaves glanced chuckled honked barkedmoaned growled peeked blushed beeped lol’ed squealed gasped hollered cringed whistled whined glared lold grinned smirked hissed snored lolled holla’d       lol-ed laffed meowedstuttered groaned flinched

The clusters in HTML: http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html


Power (Supreme Court Justices and Wikipedia editors)!

19 Sep

Cristian Danescu-Niculescu-Mizil, Timothy Hawes, and colleagues have released some more corpora that are worth playing with.

  • The Wikipedia Talk Page Conversations Corpus: 125,000 conversations involving about 30,000 editors. Metadata such as editor’s status, time of status change and gender is included.
  • Supreme Court Dialogs Corpus: oral arguments making up 51,498 utterances (50,389 conversational exchanges); 204 cases with 11 justices and 311 other participants (lawyers, for example). You get case outcome, justice vote, gender annotation, etc. 

These corpora support really interesting work about how accommodation and power go together.

One of my interests is how the word little gets used in terms of power relationships (see also my dissertation). I did a quick look through the Supreme Court. (Very very quick, so I make no conclusions, here, just report some numbers.)

Justices Breyer and Ginsburg both talk a lot during oral arguments (their speech represents 15.51% and 11.37% of all of the speech-of-justices that are in the corpus). There are 232 turns involving the word “little”. That’s not quite as much as I’d like to make really strong claims. But these judges use the word at very different rates: Breyer uses it 71 times–nearly twice as often as we’d expect if little was distributed across the justices based on their total number of turns. Ginsburg, on the other hand, only uses it 10 times, that’s 38% of what we might have expected.

I’m not going to offer any analysis here, but I do want to give some examples of the work that little does–it is sometimes dismissive, sometimes hedging (it is especially likely to hedge states that the justices are claiming for themselves, it seems). Here are a few examples from Breyer:

  • “they have a little paragraph of explanation”
  • “now, that’s a little tough”
  • “you put a little thing in the corner”
  • “I’m a little nervous about it”

Other justices talk about being a little puzzled, confused, etc. But notice that Ginsburg never pairs little with any kind of mental state. The very closest she comes is something pretty far off (“I” is not in the utterance): “It’s — given that we’re dealing with sophisticated judges, the same panel in both episodes, it’s a little hard to — to see where the due process violation is.”

Btw, the string “I mean,” is used 2,020 times. Justices and non-justices speak roughly the same amount of time but the justices are the ones who are using “I mean,” much more (in terms of utterances that have the phrase in it, justices use 1.4 times more often than we’d expect them to; 2.6 times more often than the non-justices who talk). Wanna know who is the biggest “I mean,”‘er? Have a guess?

Super-European language translation corpus

15 Apr

The DGT-TM corpus takes sentences from 22 European languages and has translations (manually produced) for 21 other languages. That means there are about 3 million sentences per language.

With all pairs among these 22 languages:

  • Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish
Yes–you got that, right, it’s all European, but it’s not *just* Indo-European.

It’s triple the size that it was in 2007.

The corpus is made up of all European legislation: treaties, regulations, directives, etc. (If you join the EU you have to accept the whole Acquis Communautaire, so you have to get it translated into your official languages.)

Verb phrase ellipsis corpus

14 Apr

One of the hot topics in syntax dissertations in recent years has been ellipsis (I’m not kidding–when I was on a job search committee a few years ago, ellipsis was *everywhere*).

Bos and Spenader (2011) have an annotated corpus!


11 Apr

Stephanie Shih has been doing really fun work on what makes a name (first and last) using a corpus of Facebook names. This helps her get recent trends–the Social Security Administration releases first names all the time, but it doesn’t release first+last until 100 years after the birth certificates come in.

Amaç Herdagdelen has compiled  Census data from 1990 and put it with the Social Security Administration’s statistics for popular baby names for every year between 1960 and 2010:

World Englishes

9 Apr

Elizabeth Traugott offers these suggestions for corpora on world Englishes:

Erorrs erorrs evrerywehere

7 Apr

Earlier I showed a neat resource that has grammatical and ungrammatical sentences from various linguistics papers over the years. But what if you want a whole bunch of English errors?

In that case, check out work that David Hale and Adam Kilgarriff are putting together: http://clt.mq.edu.au/research/projects/hoo/.

You might also look around for ESL corpora, for example:

  • CK Jung’s lab has a one-million word written corpus of Korean learners of English (“YELC”): http://www.uclouvain.be/en-cecl-lcworld.html.
  • Izumia, Uchimotoa, and Isaharaa (2004) have a Japanese Learner English Corpus, too (NICT JLE), but I think you have to email them.