Before talking about specific Arabic resources, let me suggest some search engines that will be useful for Arabic–and many other languages, too:
If you want to check out which LDC corpora we have that have Arabic, run a search here: http://www.ldc.upenn.edu/Catalog/catalogSearch.jsp.
Another good place to search is LRE Map (“Language Resources and Evaluation), which collects info from various NLP conferences. The interface is somehow both simple and confusing. To find Arabic resources, click “Lang” in the top row and then look over to the left-hand panel. Leave “Resource Name” blank and just use the “Resource Language” drop-down menu: http://www.resourcebook.eu/LreMap/faces/views/resourceMap.xhtml
A lot of corpora you find in Arabic are fairly formal since they come from news reports. I’m going to focus mostly on conversational stuff. First stuff from the LDC:
- Egyptian Arabic CALLHOME and CALLFRIEND. The CALLHOME corpus involves 120 half-hour conversations between native Egyptian Arabic speakers . 5-10 of those minutes are transcribed (but you can find various parts of CALLHOME repurposed and transcribed in different ways, for example in the 1997 HUB5 and 2003 NIST collections from the LDC). Note that there is a “supplement” that gives another 20 conversations. The CALLFRIEND corpus involves 60 Egyptian Arabic conversations between people living in the US. I haven’t found as much transcription and processing of it, so probably lean towards CALLHOME.
- Fisher Levantine Arabic. The Fisher method (the English version is a great resource, too, worth considering as a replacement for Switchboard stuff, btw) is to have strangers call and talk to each other. Here the data is mostly from folks around Jordan. Each pair of strangers talks about a specified topic for a while, so you get interesting topic and demographic information.
- Gulf Arabic phone calls–975 speakers engaged in spontaneous conversations lasting about six minutes each. There’s also a Levantine version of this and a version of this for Iraqi Arabic. The Levantine and Gulf Arabic corpora are about the same size, the Iraqi Arabic one has about half the number of speakers.
- 901 phone calls, mostly between Arabic speakers from Lebanon.
- OntoNotes 4.0 isn’t really conversational, but it is parsed and may be useful.
Now for some stuff outside of LDC:
- There’s PropBank work, that’ll get you (Lebanese news) annotated with verbal proposition and argument information.
- Aralex has info useful to psycholinguistics-y stuff (frequency info, for example). Read about it here, or sign up here.
- Saad Motaz makes a bunch of Arabic sources available here: https://sites.google.com/site/motazsite/Home/osac (don’t be intimidated by the Arabic–just scroll down for English). This tends to be “newsy” (CNN, etc).
- Mourad Abbas gives us: https://sites.google.com/site/mouradabbas9/corpora also has newsy stuff, used for topic identification.
- If you’re interested in Quranic corpora, Eric Atwell recommends checking out http://corpus.quran.com/ and/or http://quranytopics.appspot.com/.
- And, btw, there is Arabic WordNet.