Archive | September, 2012

Tweet parser and word clusters

22 Sep

Brendan O’Connor & Co. from CMU have updated their tweet parser and provided a bunch of other stuff, including a collection of 56 million English-language tweets.

They’ve also done some clustering work on the words. Some of their clusters make a lot of sense immediately:

  • haven’t havent shoulda would’ve should’ve hadn’t woulda could’ve coulda havnt shouldve wouldve must’ve musta couldve haven’t havn’t hadnt might’ve hvnt mustve shuda wudashudda wudda shulda wulda mighta cudda have’nt wudve shudve hvent #glocalurban hadn’t haven`t mightve shlda haven´t culda should’ve wlda avnt would’ve hvn’t may’ve cudveshldve have’t could’ve

Others are intriguing, for example, I believe gaydar may be an actual body part, given its cluster:

  • body brain soul skin stomach throat belly tummy ego imagination gut liver jaw spine bladder handwriting scalp body’s subconscious uterus complexion stomache eyesight naveltorso palate bodys demeanor physique waistline clitoris abdomen spleen gaydar gallbladder pocketbook bdy bodyy tummy’s tailbone ringback ribcage cervix skinn throat’sescentuals skin’s sternum ellum cell’s

Btw,  look at all the ways to put lol in the past tense!

  • looked felt laughed yelled tasted screamed smiled smelled acted shouted stared waved lol’d smelt bitched giggled winked loled lookd behaves glanced chuckled honked barkedmoaned growled peeked blushed beeped lol’ed squealed gasped hollered cringed whistled whined glared lold grinned smirked hissed snored lolled holla’d       lol-ed laffed meowedstuttered groaned flinched

The clusters in HTML:


Aboriginal Australian language support

21 Sep

Here’s a table Piers Kelley put together for the R-N-L-D mailing list. It has some handy resources for people interested in corpora as well as language documentation/preservation/teaching/learning.

Site Service Main audience
Aboriginal Education, NSW Board of Studies Teaching resources for NSW languages, with useful material for developing teaching programs in other parts of Australia Teachers, language activists, speakers Austlang, Aseda, Ozbib, Language and People Thesaurus, etc. Speakers, linguists
AuSil SIL’s site for Australian languages. Downloadable dictionaries, grammatical descriptions, mobile phone apps. Linguists, speakers
Australian-languages Mailing list for individuals and community groups who support the future of Australian languages Speakers, linguists, language activists
Australian-linguistics Mailing list for discussing issues relevant to linguists working on Australian languages Linguists
Batchelor Institute’s CALL collection Digitised materials on NT languages for enrolled students Students at Batchelor
CDU Yolngu studies Class resources for studying Yolngu at CDU Learners
David Nash’s site Bibliographies for research on Australian languages, particularly of central Australia Linguists
David Nathan’s ‘Aboriginal Languages of Australia’ Links to web resources for Australian languages including newspaper articles Public, linguists, speakers
Endangered languages World wide catalogue of endangered languages Public
Endangered languages and cultures blog Useful commentary on Australian (and other) languages Linguists Language-specific social networking groups Speakers
Handbook of WA languages
An annotated bibliography and guide to the indigenous languages of part of Western Australia Linguists
Language centre websites
Information about Australian languages at a regional level. Some have online dictionaries and other resources. Speakers, public
Living Archive of Aboriginal languages Database of materials produced by NT bilingual school programs Teachers Software for language documentation Speaker-linguists, linguists
Mogwi Dahn Email list and resources for speakers wanting to work professionally with their languages Speaker-linguists, language activists
Ngapartji Ngapartji For learning Pitjantjatjara Learners
Our Languages Broad information on languages, language in the news, events, cultural protocols etc Speakers, public Archived email list for issues in language endangerment and technical questions about language documentation; links, news etc Linguists, language activists
Wikipedia Numerous detailed entries on Australian languages Public, linguists, speakers

Power (Supreme Court Justices and Wikipedia editors)!

19 Sep

Cristian Danescu-Niculescu-Mizil, Timothy Hawes, and colleagues have released some more corpora that are worth playing with.

  • The Wikipedia Talk Page Conversations Corpus: 125,000 conversations involving about 30,000 editors. Metadata such as editor’s status, time of status change and gender is included.
  • Supreme Court Dialogs Corpus: oral arguments making up 51,498 utterances (50,389 conversational exchanges); 204 cases with 11 justices and 311 other participants (lawyers, for example). You get case outcome, justice vote, gender annotation, etc. 

These corpora support really interesting work about how accommodation and power go together.

One of my interests is how the word little gets used in terms of power relationships (see also my dissertation). I did a quick look through the Supreme Court. (Very very quick, so I make no conclusions, here, just report some numbers.)

Justices Breyer and Ginsburg both talk a lot during oral arguments (their speech represents 15.51% and 11.37% of all of the speech-of-justices that are in the corpus). There are 232 turns involving the word “little”. That’s not quite as much as I’d like to make really strong claims. But these judges use the word at very different rates: Breyer uses it 71 times–nearly twice as often as we’d expect if little was distributed across the justices based on their total number of turns. Ginsburg, on the other hand, only uses it 10 times, that’s 38% of what we might have expected.

I’m not going to offer any analysis here, but I do want to give some examples of the work that little does–it is sometimes dismissive, sometimes hedging (it is especially likely to hedge states that the justices are claiming for themselves, it seems). Here are a few examples from Breyer:

  • “they have a little paragraph of explanation”
  • “now, that’s a little tough”
  • “you put a little thing in the corner”
  • “I’m a little nervous about it”

Other justices talk about being a little puzzled, confused, etc. But notice that Ginsburg never pairs little with any kind of mental state. The very closest she comes is something pretty far off (“I” is not in the utterance): “It’s — given that we’re dealing with sophisticated judges, the same panel in both episodes, it’s a little hard to — to see where the due process violation is.”

Btw, the string “I mean,” is used 2,020 times. Justices and non-justices speak roughly the same amount of time but the justices are the ones who are using “I mean,” much more (in terms of utterances that have the phrase in it, justices use 1.4 times more often than we’d expect them to; 2.6 times more often than the non-justices who talk). Wanna know who is the biggest “I mean,”‘er? Have a guess?