The DGT-TM corpus takes sentences from 22 European languages and has translations (manually produced) for 21 other languages. That means there are about 3 million sentences per language.
With all pairs among these 22 languages:
- Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish
Yes–you got that, right, it’s all European, but it’s not *just* Indo-European.
It’s triple the size that it was in 2007.
The corpus is made up of all European legislation: treaties, regulations, directives, etc. (If you join the EU you have to accept the whole Acquis Communautaire, so you have to get it translated into your official languages.)
One of the hot topics in syntax dissertations in recent years has been ellipsis (I’m not kidding–when I was on a job search committee a few years ago, ellipsis was *everywhere*).
Bos and Spenader (2011) have an annotated corpus!
Mohammad Sadegh Rasooli has put together a rule-based verb-inflector for Persian as part of work for preprocessing the Persian dependency treebank and some other work, find out more:
Elizabeth Traugott offers these suggestions for corpora on world Englishes:
Earlier I showed a neat resource that has grammatical and ungrammatical sentences from various linguistics papers over the years. But what if you want a whole bunch of English errors?
In that case, check out work that David Hale and Adam Kilgarriff are putting together: http://clt.mq.edu.au/research/projects/hoo/.
You might also look around for ESL corpora, for example:
- CK Jung’s lab has a one-million word written corpus of Korean learners of English (“YELC”): http://www.uclouvain.be/en-cecl-lcworld.html.
- Izumia, Uchimotoa, and Isaharaa (2004) have a Japanese Learner English Corpus, too (NICT JLE), but I think you have to email them.
BootCaT is meant to help folks build up their own corpora from the Internet. However, it uses the Bing API and may not be able to so for much longer, so it may go down temporarily. Go get your corpus started now!
Note that Google also has an API that you can use (but they limit you to 100 queries a day), as does Blekko and EntireWeb.
You may also want to check out this O’Reilly summary of data sources.