Tag Archives: Spanish

Super-European language translation corpus

15 Apr

The DGT-TM corpus takes sentences from 22 European languages and has translations (manually produced) for 21 other languages. That means there are about 3 million sentences per language.

With all pairs among these 22 languages:

  • Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish
Yes–you got that, right, it’s all European, but it’s not *just* Indo-European.

It’s triple the size that it was in 2007.

The corpus is made up of all European legislation: treaties, regulations, directives, etc. (If you join the EU you have to accept the whole Acquis Communautaire, so you have to get it translated into your official languages.)


October corpora from LDC

3 Nov

(This is mostly for Stanford folks)

We get periodic shipments of new corpora from the LDC. These are always available for you to check out as DVDs (just follow steps for access here). We can also put these online so you can ssh into the Stanford servers and go to /afs/ir/data/linguistic-data.

But there’s a catch. We have a limited amount of space there–so to add something, we need to remove something. If any of these corpora–or any other corpora you know about–would be great to have online, send me a note.

Spanish Gigaword–third edition

The great thing about this corpus is that it is enormous. Depending upon your research project, you may or may not be as psyched about it being newswire text. It’s got everything the previous editions had, plus newer stuff–so it covers, roughly, the mid-1990’s til Dec 31, 2010.

Arabic Gigaword–fifth edition

Same basic deal as the Spanish Gigaword–it covers news in Arabic from 2000 until Dec 2010. Here’s what the marked-up content looks like: http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2011T11.jpg.

2008 NIST Speaker Recognition Evaluation Test Set

This is actually nine DVDs worth of data because it’s 942 hours of telephone speech and interviews. The telephone speech is multilingual–predominately English but bilinguals were recruited, so in the telephone conversations you also get  Arabic, Bengali, Chinese, Egyptian Arabic, Farsi, Hindi, Italian, Japanese, Korean, Lao, Punjabi, Russian, Tagalog, Tamil, Thai, Urdu, Uzbek, Vietnamese, Wu Chinese, Yue Chinese. The interviews are just English.

You do get the transcripts, btw. The corpus was imagined to be for speech recognition, but there may be some really interesting code-switching stuff for people interested in bilingual data.

Top five LDC corpora

30 Oct

In this post, I’d like to start off reviewing some of the most popular corpora that the Linguistics Data Consortium provides–with a few possibilities for alternatives. If you have a favorite corpus send it in!

1. TIMIT Acoustic-Phonetic Continuous Speech Corpus

If you’re interested in speech recognition, here’s one of your main resources. It’s basically 630 people (8 American dialects) reading 10 “phonetically rich sentences”. Plus these are time-aligned with transcripts (orthographic and phonetic). It’s been hand-verified and it’s pre-split into training/test subsets.

2. Web 1T 5-gram Version 1

This is basically Google n-gram stuff for English (unigrams to 5-grams). So if you want collocates and word frequencies, this is pretty good. There are 1 trillion word tokens, after all.

  • 95 billion sentences
  • 13 million unigrams
  • 1 billion 5-grams

This data was released in 2006, though, so there should be more up-to-date resources.

There’s also a 2010 (Mandarin) Chinese 5-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T06

A 2009 Japanese 7-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T08

And a 2009 “European” 5-gram on Czech Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25

3. CELEX2 (but why not try SUBTLEX?)

This corpus, circa 1996, gives you ASCII versions of three lexical databases for English, Dutch, and German. You get:

  • orthography variations
  • phonological stuff like syllables and stress
  • morphology
  • word class, argument structures
  • word frequency, lemma frequency (based on “recent and representative text corpora).

In truth, if you just want word counts for American English then consider using SUBTLEXus: http://subtlexus.lexique.org/. They make the case that CELEX is actually bad for relying on for frequency information (I’ll let you follow the link for their arguments against it and Kucera and Francis. Actually, if you go ahead and check out http://elexicon.wustl.edu/, you can download words (and non-words) with reaction times and all the morphology/phonology/syntax stuff that CELEX2 gives you.


Okay, I had never heard of this one. The main use for this corpurs is speech recognition–for digits. You get 111 men, 114 women, 50 boys, and 51 girls each pronouncing 77 different sequences of digits in 1982.

5. ECI Multilingual Text

So the European Corpus Initiative Multilingual Corpus 1 (ECI/MCI) has 46 subcorpora totally 92 million words (marked up but you can get the non-marked up stuff, too).

12 of the component corpora have parallel translated corpora from 2-9 other corpora.

Most of the stuff is journalistic, and there are some dictionaries, literature, and international organization  publications/proceedings/reports. The stuff seems to come mostly from the 1980’s and early 1990’s.

Anyone have a favorite corpus of UN delegates talking and being translated into a bunch of different languages?

Languages available: Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin Chinese, Modern Greek, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Portuguese, Russian, Serbian, Slovenian, Spanish, Standard Malay, Swedish, Turkish