The DGT-TM corpus takes sentences from 22 European languages and has translations (manually produced) for 21 other languages. That means there are about 3 million sentences per language.
With all pairs among these 22 languages:
- Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish
It’s triple the size that it was in 2007.
The corpus is made up of all European legislation: treaties, regulations, directives, etc. (If you join the EU you have to accept the whole Acquis Communautaire, so you have to get it translated into your official languages.)