December | 2011 | Corpus linguistics

Archive | December, 2011

Ho ho ho, December’s new LDC corpora

December has brought us 18 DVDs worth of data.

Chinese Gigaword Fifth Edition (1 DVD)

Known to some of you as LDC2011T13, this is Mandarin Chinese newswire stuff. Here’s what the data looks like. If you’re working on Chinese, you probably want this.

2006 NIST Speaker Recognition Evaluation Training Set (7 DVDs)

“Honey, it’s your mother.” If you don’t recognize that voice, try developing a better algorithm on this training set: LDC2011S09. This is telephone speech mostly in English, but also in Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai, Urdu, and Yue Chinese. It’s 595 hours and there are English transcripts for the non-English parts.

You don’t have to be interested in only speaker identification to use this–ESL stuff, code-switching, and discourse studies would all make sense for the data here.

2006 NIST/USF Evaluation Resources for the VACE Program – Meeting Data Test Set Part 2 (10 DVDs)

In the catalog, this is called LDC2011V06 and I think you should probably follow that link. But basically you get 20 hours of meetings (held in research institutions in Pennsylvania, Virgina, Maryland, Scotland, Switzerland, and the Netherlands. (But all in English, I believe.)

For we linguists who normally work with just audio or text, this is a very rich video database. The VACE program’s goal was to extract video content automatically and to understand events. So there’s tracking of faces, hands, people, vehicles, and text. In other VACE corpora, you can get other meetings as well as broadcast news, street surveillance, unmanned aerial vehicle motion imagery. Uh, okay, so if you’re a linguist looking at unmanned aerial vehicle motion imagery, you should send me a note to tell me more. But for the rest of us, this meeting data shows group dynamics that could go in any number of directions.

Tags: Arabic, Bengali, Chinese, computational, english, Hindi, Korean, ldc, Mandarin, newswire, Russian, Thai, Urdu, voice recognition, Yue

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Ho ho ho, December’s new LDC corpora

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?