Arabic corpora

6 Feb

Before talking about specific Arabic resources, let me suggest some search engines that will be useful for Arabic–and many other languages, too:

If you want to check out which LDC corpora we have that have Arabic, run a search here: http://www.ldc.upenn.edu/Catalog/catalogSearch.jsp.

Another good place to search is LRE Map (“Language Resources and Evaluation), which collects info from various NLP conferences. The interface is somehow both simple and confusing. To find Arabic resources, click “Lang” in the top row and then look over to the left-hand panel. Leave “Resource Name” blank and just use the “Resource Language” drop-down menu: http://www.resourcebook.eu/LreMap/faces/views/resourceMap.xhtml

A lot of corpora you find in Arabic are fairly formal since they come from news reports. I’m going to focus mostly on conversational stuff. First stuff from the LDC:

Egyptian Arabic CALLHOME and CALLFRIEND. The CALLHOME corpus involves 120 half-hour conversations between native Egyptian Arabic speakers . 5-10 of those minutes are transcribed (but you can find various parts of CALLHOME repurposed and transcribed in different ways, for example in the 1997 HUB5 and 2003 NIST collections from the LDC). Note that there is a “supplement” that gives another 20 conversations. The CALLFRIEND corpus involves 60 Egyptian Arabic conversations between people living in the US. I haven’t found as much transcription and processing of it, so probably lean towards CALLHOME.
Fisher Levantine Arabic. The Fisher method (the English version is a great resource, too, worth considering as a replacement for Switchboard stuff, btw) is to have strangers call and talk to each other. Here the data is mostly from folks around Jordan. Each pair of strangers talks about a specified topic for a while, so you get interesting topic and demographic information.
Gulf Arabic phone calls–975 speakers engaged in spontaneous conversations lasting about six minutes each. There’s also a Levantine version of this and a version of this for Iraqi Arabic. The Levantine and Gulf Arabic corpora are about the same size, the Iraqi Arabic one has about half the number of speakers.
901 phone calls, mostly between Arabic speakers from Lebanon.
OntoNotes 4.0 isn’t really conversational, but it is parsed and may be useful.

Now for some stuff outside of LDC:

There’s PropBank work, that’ll get you (Lebanese news) annotated with verbal proposition and argument information.
Aralex has info useful to psycholinguistics-y stuff (frequency info, for example). Read about it here, or sign up here.
Saad Motaz makes a bunch of Arabic sources available here: https://sites.google.com/site/motazsite/Home/osac (don’t be intimidated by the Arabic–just scroll down for English). This tends to be “newsy” (CNN, etc).
Mourad Abbas gives us: https://sites.google.com/site/mouradabbas9/corpora also has newsy stuff, used for topic identification.
If you’re interested in Quranic corpora, Eric Atwell recommends checking out http://corpus.quran.com/ and/or http://quranytopics.appspot.com/.
And, btw, there is Arabic WordNet.

Comments 2 Comments
Categories Uncategorized

2 Responses to “Arabic corpora”

Robert February 6, 2012 at 5:51 pm #

Thanks for this blog and for this summary of Arabic resources.

In terms of sheer usefulness for Arabic teachers or students, I would highly recommend Dil Parkinson’s ArabiCorpus: arabicorpus.byu.edu. It contains 30+ million words, most of which are taken from Arabic media sources, but a few other texts are also included. The corpus is untagged, but it’s still great for garden variety word-in-context searches. And with a little knowledge of regular expressions one can dig a little deeper.

Reply

Trackbacks/Pingbacks

African language corpora « Corpus linguistics - February 8, 2012
[…] Moving away from Bantu, for Amharic, there’s a whole site for resources to help you: http://corpora.amharic.org/. (Speaking of Semitic languages, if you want Arabic, check out my earlier post here). […]

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Arabic corpora

2 Responses to “Arabic corpora”

Trackbacks/Pingbacks

Leave a comment Cancel reply

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?

Search

Arabic corpora

Share this:

Related

2 Responses to “Arabic corpora”

Trackbacks/Pingbacks

Leave a comment Cancel reply

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?