Archive | August, 2012

South Asian resources

28 Aug

Just a while ago, I mentioned a corpus of Malto, a Dravidian language of NE India and Bangladesh. But what about other South Asian resources? (Given the MILLIONS of people speaking many South Asian languages, there really ought to be a lot of resources…please suggest more for me to include here!)

  • Check out EMILLE for Bengali, Gujarati, Hindi, Punjabi, Urdu, Singhalese, Tamil, Assamese, Kannada, Kashmiri, Malayalam, Marathi, Oriya, and Telugu.
  • ELRA-S0344 LILA Hindi Belt database of 2,023 Hindi speakers (1,011 males and 1,012 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered 83 read and spontaneous items.


Note that crowdsourcing on Amazon Mechanical Turk may be a great source of data for projects on Tamil, Malayalam, Hindi, Urdu, Punjabi, Marathi, Kutchi, Kannada, Telugu, and Gujarati. (See the chart on Rob Munro’s site here and if you want more about crowdsourcing reliability and best practice, check out my article with Victor Kuperman here or presentation version).

Resources from the LDC:


Malto (a Dravidian language) corpus for dialect, folklore research and more!

21 Aug

Check out the new LDC corpus of Malto, which is a Dravidian language of India (I think ~200k speakers).

It’s 8 hours, 27 speakers (glosses/transcriptions/etc for 6 of the 8 hours).

This is one of those projects we should feel great about. Thanks to Masato Kobayashi and Bablu Tirkey (and the LDC) for making the data available. There’s not a lot of literature on Malto. If you’re interested in some facts, check out info from the World Atlas of Language Structures (there’s not a lot there…it’s SOV, fwiw).

I’m gonna throw in some names I think are relevant (alternate names/dialects…some of these may be wrong, sorry): Malatri, Maler, Malti, Malto, Maltu, Sawriya Malto, Sahibganj, Godda, Hiranpur, Litipara (Chatgam), Kumar, Mad, Mal, Maler, Malti, Malto, Maltu, Paharia, Pahariya.