Archive | December, 2011

Ho ho ho, December’s new LDC corpora

6 Dec

December has brought us 18 DVDs worth of data.

Chinese Gigaword Fifth Edition (1 DVD)

Known to some of you as LDC2011T13, this is Mandarin Chinese newswire stuff. Here’s what the data looks like. If you’re working on Chinese, you probably want this.

2006 NIST Speaker Recognition Evaluation Training Set (7 DVDs)

“Honey, it’s your mother.” If you don’t recognize that voice, try developing a better algorithm on this training set: LDC2011S09. This is telephone speech mostly in English, but also in Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai, Urdu, and Yue Chinese. It’s 595 hours and there are English transcripts for the non-English parts.

You don’t have to be interested in only speaker identification to use this–ESL stuff, code-switching, and discourse studies would all make sense for the data here.

2006 NIST/USF Evaluation Resources for the VACE Program – Meeting Data Test Set Part 2 (10 DVDs)

In the catalog, this is called LDC2011V06 and I think you should probably follow that link. But basically you get 20 hours of meetings (held in research institutions in Pennsylvania, Virgina, Maryland, Scotland, Switzerland, and the Netherlands. (But all in English, I believe.)

For we linguists who normally work with just audio or text, this is a very rich video database. The VACE program’s goal was to extract video content automatically and to understand events. So there’s tracking of faces, hands, people, vehicles, and text. In other VACE corpora, you can get other meetings as well as broadcast news, street surveillance, unmanned aerial vehicle motion imagery. Uh, okay, so if you’re a linguist looking at unmanned aerial vehicle motion imagery, you should send me a note to tell me more. But for the rest of us, this meeting data shows group dynamics that could go in any number of directions.