Archive | April, 2012

Super-European language translation corpus

15 Apr

The DGT-TM corpus takes sentences from 22 European languages and has translations (manually produced) for 21 other languages. That means there are about 3 million sentences per language.

With all pairs among these 22 languages:

  • Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish
Yes–you got that, right, it’s all European, but it’s not *just* Indo-European.

It’s triple the size that it was in 2007.

The corpus is made up of all European legislation: treaties, regulations, directives, etc. (If you join the EU you have to accept the whole Acquis Communautaire, so you have to get it translated into your official languages.)


Verb phrase ellipsis corpus

14 Apr

One of the hot topics in syntax dissertations in recent years has been ellipsis (I’m not kidding–when I was on a job search committee a few years ago, ellipsis was *everywhere*).

Bos and Spenader (2011) have an annotated corpus!


11 Apr

Stephanie Shih has been doing really fun work on what makes a name (first and last) using a corpus of Facebook names. This helps her get recent trends–the Social Security Administration releases first names all the time, but it doesn’t release first+last until 100 years after the birth certificates come in.

Amaç Herdagdelen has compiled  Census data from 1990 and put it with the Social Security Administration’s statistics for popular baby names for every year between 1960 and 2010:

Persian verb inflections

10 Apr

Mohammad Sadegh Rasooli has put together a rule-based verb-inflector for Persian as part of work for preprocessing the Persian dependency treebank and some other work, find out more:

World Englishes

9 Apr

Elizabeth Traugott offers these suggestions for corpora on world Englishes:

Erorrs erorrs evrerywehere

7 Apr

Earlier I showed a neat resource that has grammatical and ungrammatical sentences from various linguistics papers over the years. But what if you want a whole bunch of English errors?

In that case, check out work that David Hale and Adam Kilgarriff are putting together:

You might also look around for ESL corpora, for example:

  • CK Jung’s lab has a one-million word written corpus of Korean learners of English (“YELC”):
  • Izumia, Uchimotoa, and Isaharaa (2004) have a Japanese Learner English Corpus, too (NICT JLE), but I think you have to email them.

Build your own corpus (well, for now)

1 Apr

BootCaT is meant to help folks build up their own corpora from the Internet. However, it uses the Bing API and may not be able to so for much longer, so it may go down temporarily. Go get your corpus started now!

Note that Google also has an API that you can use (but they limit you to 100 queries a day), as does Blekko and EntireWeb.

You may also want to check out this O’Reilly summary of data sources.