28 Aug

Just a while ago, I mentioned a corpus of Malto, a Dravidian language of NE India and Bangladesh. But what about other South Asian resources? (Given the MILLIONS of people speaking many South Asian languages, there really ought to be a lot of resources…please suggest more for me to include here!)

  • Check out EMILLE for Bengali, Gujarati, Hindi, Punjabi, Urdu, Singhalese, Tamil, Assamese, Kannada, Kashmiri, Malayalam, Marathi, Oriya, and Telugu.
  • ELRA-S0344 LILA Hindi Belt database of 2,023 Hindi speakers (1,011 males and 1,012 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered 83 read and spontaneous items.


Note that crowdsourcing on Amazon Mechanical Turk may be a great source of data for projects on Tamil, Malayalam, Hindi, Urdu, Punjabi, Marathi, Kutchi, Kannada, Telugu, and Gujarati. (See the chart on Rob Munro’s site here and if you want more about crowdsourcing reliability and best practice, check out my article with Victor Kuperman here or presentation version).

Resources from the LDC:

Ho ho ho, December’s new LDC corpora

6 Dec

December has brought us 18 DVDs worth of data.

Chinese Gigaword Fifth Edition (1 DVD)

Known to some of you as LDC2011T13, this is Mandarin Chinese newswire stuff. Here’s what the data looks like. If you’re working on Chinese, you probably want this.

2006 NIST Speaker Recognition Evaluation Training Set (7 DVDs)

“Honey, it’s your mother.” If you don’t recognize that voice, try developing a better algorithm on this training set: LDC2011S09. This is telephone speech mostly in English, but also in Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai, Urdu, and Yue Chinese. It’s 595 hours and there are English transcripts for the non-English parts.

You don’t have to be interested in only speaker identification to use this–ESL stuff, code-switching, and discourse studies would all make sense for the data here.

2006 NIST/USF Evaluation Resources for the VACE Program – Meeting Data Test Set Part 2 (10 DVDs)

In the catalog, this is called LDC2011V06 and I think you should probably follow that link. But basically you get 20 hours of meetings (held in research institutions in Pennsylvania, Virgina, Maryland, Scotland, Switzerland, and the Netherlands. (But all in English, I believe.)

For we linguists who normally work with just audio or text, this is a very rich video database. The VACE program’s goal was to extract video content automatically and to understand events. So there’s tracking of faces, hands, people, vehicles, and text. In other VACE corpora, you can get other meetings as well as broadcast news, street surveillance, unmanned aerial vehicle motion imagery. Uh, okay, so if you’re a linguist looking at unmanned aerial vehicle motion imagery, you should send me a note to tell me more. But for the rest of us, this meeting data shows group dynamics that could go in any number of directions.

October corpora from LDC

3 Nov

(This is mostly for Stanford folks)

We get periodic shipments of new corpora from the LDC. These are always available for you to check out as DVDs (just follow steps for access here). We can also put these online so you can ssh into the Stanford servers and go to /afs/ir/data/linguistic-data.

But there’s a catch. We have a limited amount of space there–so to add something, we need to remove something. If any of these corpora–or any other corpora you know about–would be great to have online, send me a note.

Spanish Gigaword–third edition

The great thing about this corpus is that it is enormous. Depending upon your research project, you may or may not be as psyched about it being newswire text. It’s got everything the previous editions had, plus newer stuff–so it covers, roughly, the mid-1990′s til Dec 31, 2010.

Arabic Gigaword–fifth edition

Same basic deal as the Spanish Gigaword–it covers news in Arabic from 2000 until Dec 2010. Here’s what the marked-up content looks like: http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2011T11.jpg.

2008 NIST Speaker Recognition Evaluation Test Set

This is actually nine DVDs worth of data because it’s 942 hours of telephone speech and interviews. The telephone speech is multilingual–predominately English but bilinguals were recruited, so in the telephone conversations you also get  Arabic, Bengali, Chinese, Egyptian Arabic, Farsi, Hindi, Italian, Japanese, Korean, Lao, Punjabi, Russian, Tagalog, Tamil, Thai, Urdu, Uzbek, Vietnamese, Wu Chinese, Yue Chinese. The interviews are just English.

You do get the transcripts, btw. The corpus was imagined to be for speech recognition, but there may be some really interesting code-switching stuff for people interested in bilingual data.


