Just a while ago, I mentioned a corpus of Malto, a Dravidian language of NE India and Bangladesh. But what about other South Asian resources? (Given the MILLIONS of people speaking many South Asian languages, there really ought to be a lot of resources…please suggest more for me to include here!)
- Check out EMILLE for Bengali, Gujarati, Hindi, Punjabi, Urdu, Singhalese, Tamil, Assamese, Kannada, Kashmiri, Malayalam, Marathi, Oriya, and Telugu.
- ELRA-S0344 LILA Hindi Belt database of 2,023 Hindi speakers (1,011 males and 1,012 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered 83 read and spontaneous items.
Note that crowdsourcing on Amazon Mechanical Turk may be a great source of data for projects on Tamil, Malayalam, Hindi, Urdu, Punjabi, Marathi, Kutchi, Kannada, Telugu, and Gujarati. (See the chart on Rob Munro’s site here and if you want more about crowdsourcing reliability and best practice, check out my article with Victor Kuperman here or presentation version).
- Osborne and colleagues used Mechanical Turk to get 2k sentences translated into Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu
Resources from the LDC:
- Urdu news
- Hindi WordNet
- Tamil-English super-dictionary
- Urdu read speech
- Part of speech tagging for Bengali, Hindi, Sanskrit