April | 2012 | Corpus linguistics

Archive | April, 2012

Super-European language translation corpus

The DGT-TM corpus takes sentences from 22 European languages and has translations (manually produced) for 21 other languages. That means there are about 3 million sentences per language.

With all pairs among these 22 languages:

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish

Yes–you got that, right, it’s all European, but it’s not *just* Indo-European.

http://langtech.jrc.ec.europa.eu/DGT-TM.html

It’s triple the size that it was in 2007.

The corpus is made up of all European legislation: treaties, regulations, directives, etc. (If you join the EU you have to accept the whole Acquis Communautaire, so you have to get it translated into your official languages.)

Paper: Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos & Patrick Schlüter (2012).

Tags: Bulgarian, Czech, Danish, Dutch, english, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish, translation

Comments Leave a Comment
Categories Uncategorized

Verb phrase ellipsis corpus

14 Apr

One of the hot topics in syntax dissertations in recent years has been ellipsis (I’m not kidding–when I was on a job search committee a few years ago, ellipsis was *everywhere*).

Bos and Spenader (2011) have an annotated corpus!

Bos, Johan and Jennifer Spenader. 2011. An annotated corpus for the analysis of VP ellipsis. Language Resources and Evaluation 45(4): 463–494.

Tags: ellipsis, english

Comments Leave a Comment
Categories Uncategorized

Names

11 Apr

Stephanie Shih has been doing really fun work on what makes a name (first and last) using a corpus of Facebook names. This helps her get recent trends–the Social Security Administration releases first names all the time, but it doesn’t release first+last until 100 years after the birth certificates come in.

Stephanie’s talk at the LSA

Amaç Herdagdelen has compiled Census data from 1990 and put it with the Social Security Administration’s statistics for popular baby names for every year between 1960 and 2010:

Data, code: https://github.com/amacinho/Name-Gender-Guesser
- Orwant and Daly’s older Perl module has fuzzy search capabilities (phonetic similarity of names): http://search.cpan.org/~edaly/Text-GenderFromName-0.32/GenderFromName.pm
Paper: http://clic.cimec.unitn.it/amac/twitter_ngram/Herdagdelen2012-RTC-draft.pdf

Tags: America, census, english, gender, geography, names, phonology, phonotactics, social security

Comments Leave a Comment
Categories Uncategorized

Persian verb inflections

10 Apr

Mohammad Sadegh Rasooli has put together a rule-based verb-inflector for Persian as part of work for preprocessing the Persian dependency treebank and some other work, find out more:

Project: http://dadegan.ir/en
Code: https://github.com/rasoolims/PersianVerbAnalyzer
Paper: http://dl.acm.org/citation.cfm?id=2178234

Tags: code, Farsi, Persian, verbs

Comments Leave a Comment
Categories Uncategorized

World Englishes

9 Apr

Elizabeth Traugott offers these suggestions for corpora on world Englishes:

eWAVE = The electronic World Atlas of Varieties of English. 2011. Edited by Bernd Kortmann and Kerstin Lunkenheimer. Leipzig: Max Planck Institute for Evolutionary Anthropology. http://www.ewave-atlas.org/.
ICE = The International Corpus of English, version 2. 2006. Coordinated by Gerald Nelson (University of Hong Kong). http://ice-corpora.net/ice/index.htm.
ONZE = Origins of New Zealand English Corpus. In progress. Compiled by the ONZE project team. University of Canterbury. http://www.lacl.canterbury.ac.nz/onze/index.html.
SAVE = The South Asian Varieties of English Corpus. 2011. Compiled by Joybrato Mukherjee, Tobias Bernaisch, Christopher Koch, and Marco Schilk. University of Giessen. http://www.uni-giessen.de/cms/faculties/f05/engl/ling/research/save.

Tags: Aboriginal E, Acrolectal Fiji E, alternations, Appalachian E, Australian E, Australian Vernacular E, Bahamian C, Bahamian E, Bangladesh, Barbadian C (Bajan), Belizean C, Bislama, Black South African E, British Creole, Butler E, Cameroon E, Cameroon Pidgin, Canada, Channel Islands E, Chicano E, Colloquial American E, Colloquial Fiji E, Colloquial Singapore E, dative, dialects, ditransitive, Earlier African American Vernacular E, East Africa, East Anglia, Eastern Maroon C, english, Falkland Islands E, Ghanaian E, Ghanaian Pidgin, Great Britain, Gullah, Guyanese C, Hawaiian C, Hong Kong, Hong Kong E, India, Indian E, Indian South African E, Ireland, Irish E, Jamaica, Jamaican C, Jamaican E, Kenyan E, Krio, Liberian Settler E, Malaysian E, Maldives, Manx E, Nepal, New Zealand, New Zealand E, Newfoundland E, Nigerian E, Nigerian Pidgin, Norf'k, North of England, Orkney and Shetland E, Ozark E, Pakistan, Pakistan E, Palmerston E, Philippines, Roper River C (Kriol), Rural African American Vernacular E, San Andrés C, Saramaccan, Scottish E, SE of England, Singapore, South Asia, Southeast American Enclave dialects, Sranan, Sri Lanka, Sri Lanka E, St. Helena E, SW of England, Tanzanian E, Tok Pisin, Torres Strait C, Trinidadian C, Tristan da Cunha E, Ugandan E, Urban African American Vernacular E, USA, Vernacular Liberian E, Vincentian C, Welsh E, White South African E, White Zimbabwean E, world, [Maltese E]

Comments Leave a Comment
Categories Uncategorized

Erorrs erorrs evrerywehere

7 Apr

Earlier I showed a neat resource that has grammatical and ungrammatical sentences from various linguistics papers over the years. But what if you want a whole bunch of English errors?

In that case, check out work that David Hale and Adam Kilgarriff are putting together: http://clt.mq.edu.au/research/projects/hoo/.

You might also look around for ESL corpora, for example:

CK Jung’s lab has a one-million word written corpus of Korean learners of English (“YELC”): http://www.uclouvain.be/en-cecl-lcworld.html.
Izumia, Uchimotoa, and Isaharaa (2004) have a Japanese Learner English Corpus, too (NICT JLE), but I think you have to email them.

Tags: english, errors, ESL, Japanese, Korean, ungrammatical

Comments Leave a Comment
Categories Uncategorized

Build your own corpus (well, for now)

1 Apr

BootCaT is meant to help folks build up their own corpora from the Internet. However, it uses the Bing API and may not be able to so for much longer, so it may go down temporarily. Go get your corpus started now!

http://bootcat.sslmit.unibo.it/

Note that Google also has an API that you can use (but they limit you to 100 queries a day), as does Blekko and EntireWeb.

You may also want to check out this O’Reilly summary of data sources.

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Super-European language translation corpus

Verb phrase ellipsis corpus

Persian verb inflections

World Englishes

Erorrs erorrs evrerywehere

Build your own corpus (well, for now)

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?