Arabic corpora

6 Feb

Before talking about specific Arabic resources, let me suggest some search engines that will be useful for Arabic–and many other languages, too:

If you want to check out which LDC corpora we have that have Arabic, run a search here: http://www.ldc.upenn.edu/Catalog/catalogSearch.jsp.

Another good place to search is LRE Map (“Language Resources and Evaluation), which collects info from various NLP conferences. The interface is somehow both simple and confusing. To find Arabic resources, click “Lang” in the top row and then look over to the left-hand panel. Leave “Resource Name” blank and just use the “Resource Language” drop-down menu: http://www.resourcebook.eu/LreMap/faces/views/resourceMap.xhtml

A lot of corpora you find in Arabic are fairly formal since they come from news reports. I’m going to focus mostly on conversational stuff. First stuff from the LDC:

  • Egyptian Arabic CALLHOME and CALLFRIEND. The CALLHOME corpus involves 120 half-hour conversations between native Egyptian Arabic speakers . 5-10 of those minutes are transcribed (but you can find various parts of CALLHOME repurposed and transcribed in different ways, for example in the 1997 HUB5 and 2003 NIST collections from the LDC). Note that there is a “supplement” that gives another 20 conversations. The CALLFRIEND corpus involves 60  Egyptian Arabic conversations between people living in the US. I haven’t found as much transcription and processing of it, so probably lean towards CALLHOME.
  • Fisher Levantine Arabic. The Fisher method (the English version is a great resource, too, worth considering as a replacement for Switchboard stuff, btw) is to have strangers call and talk to each other. Here the data is mostly from folks around Jordan. Each pair of strangers talks about a specified topic for a while, so you get interesting topic and demographic information.
  • Gulf Arabic phone calls–975 speakers engaged in spontaneous conversations lasting about six minutes each. There’s also a Levantine version of this and a version of this for Iraqi Arabic. The Levantine and Gulf Arabic corpora are about the same size, the Iraqi Arabic one has about half the number of speakers.
  • 901 phone calls, mostly between Arabic speakers from Lebanon.
  • OntoNotes 4.0 isn’t really conversational, but it is parsed and may be useful.

Now for some stuff outside of LDC:


2 Responses to “Arabic corpora”

  1. Robert February 6, 2012 at 5:51 pm #

    Thanks for this blog and for this summary of Arabic resources.

    In terms of sheer usefulness for Arabic teachers or students, I would highly recommend Dil Parkinson’s ArabiCorpus: arabicorpus.byu.edu. It contains 30+ million words, most of which are taken from Arabic media sources, but a few other texts are also included. The corpus is untagged, but it’s still great for garden variety word-in-context searches. And with a little knowledge of regular expressions one can dig a little deeper.

Trackbacks/Pingbacks

  1. African language corpora « Corpus linguistics - February 8, 2012

    […] Moving away from Bantu, for Amharic, there’s a whole site for resources to help you: http://corpora.amharic.org/. (Speaking of Semitic languages, if you want Arabic, check out my earlier post here). […]

Leave a comment