Lao | Corpus linguistics

Tag Archives: Lao

October corpora from LDC

(This is mostly for Stanford folks)

We get periodic shipments of new corpora from the LDC. These are always available for you to check out as DVDs (just follow steps for access here). We can also put these online so you can ssh into the Stanford servers and go to /afs/ir/data/linguistic-data.

But there’s a catch. We have a limited amount of space there–so to add something, we need to remove something. If any of these corpora–or any other corpora you know about–would be great to have online, send me a note.

Spanish Gigaword–third edition

The great thing about this corpus is that it is enormous. Depending upon your research project, you may or may not be as psyched about it being newswire text. It’s got everything the previous editions had, plus newer stuff–so it covers, roughly, the mid-1990’s til Dec 31, 2010.

Arabic Gigaword–fifth edition

Same basic deal as the Spanish Gigaword–it covers news in Arabic from 2000 until Dec 2010. Here’s what the marked-up content looks like: http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2011T11.jpg.

2008 NIST Speaker Recognition Evaluation Test Set

This is actually nine DVDs worth of data because it’s 942 hours of telephone speech and interviews. The telephone speech is multilingual–predominately English but bilinguals were recruited, so in the telephone conversations you also get Arabic, Bengali, Chinese, Egyptian Arabic, Farsi, Hindi, Italian, Japanese, Korean, Lao, Punjabi, Russian, Tagalog, Tamil, Thai, Urdu, Uzbek, Vietnamese, Wu Chinese, Yue Chinese. The interviews are just English.

You do get the transcripts, btw. The corpus was imagined to be for speech recognition, but there may be some really interesting code-switching stuff for people interested in bilingual data.

Tags: Arabic, Bengali, Chinese, computational, Egyptian Arabic, english, Farsi, gigaword, Hindi, Italian, Japanese, Korean, Lao, ldc, Punjabi, Russian, Spanish, speech recognition, Tagalog, Tamil, Thai, Urdu, Uzbek, Vietnamese, Wu Chinese, Yue

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

October corpora from LDC

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?