African language corpora

8 Feb

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest).

And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) squat about any African languages. My hope is this post will be useful for Africanists who are curious about big collections of spoken/written texts. But even more so, I’m hoping that researchers who stick to English and other languages with easy, familiar corpora will take an interest and start playing with the data. Fwiw, people who study African languages are really friendly–so if you’re, say, a computational linguist, it’ll be no problem to find you a buddy.

A number of the resources below should help you find out more about areas/languages outside of Africa, too.

Why study African languages?

Larry Hyman gives some answers here (my favorite part starts on page 14–Leggbo has subject-verb-object word order for affirmative sentences, but subject-object-verb word order in the negative). The Economist gives its answers here (of note: 600 million mobile phone users in Africa–more than Europe or America).

My own answers come from an initial shock of studying Zulu after getting lessons as an anniversary gift. The clicks are wonderful, it’s true. But have you seen the morphology? Zulu is a temptress. Okungumthakathi kuyangikhwifa (‘The damn witch is bewitching me.’)

My more recent work has taken me to Ethiopia–also a beautiful country and packed with very different, very understudied languages. In one village I get to work on three completely unrelated languages (Shabo, Majang, Shekkacho). It’ll be a while before there are any corpora of size on these languages (Shabo has 300 speakers), but Amharic is a major language with a sizable population. It’s Semitic, so if you squint your eyes, you can see a relationship to Arabic and Hebrew, but Amharic is more distant than those two. Cushitic influenced it a lot, so you get subject-object-verb order, for example. You also get ejectives (make a plosive sound like t/k/p/d, there–you breathed through them. Ejectives are like those but just use the air stored above your glottis).

Finding African language corpora

First a request: if you know of corpora based on naturalistic spoken conversations, please let me know.

Kevin Scannell makes several great resources available for people studying less common languages:

It’s worth trucking over to African Language Technology and searching their archives/posting questions. Right off the bat, it has a number of resources, including:

And it’s also worth checking out all the different resources at OLAC.
For Zulu, check out the Ukwabelana Corpus, prepped by folks doing computational linguistics.

For a treasure-trove of Ndebele/Zulu lexical info–and a huge number of other Bantu languages–take a look at CBOLD (if you have issues, there’s an old version at Berkeley).

ALLEX offers resources for Ndebele, Shona, and Nambya.

Moving away from Bantu, for Amharic, there’s a whole site for resources to help you: (Speaking of Semitic languages, if you want Arabic, check out my earlier post here).

If you’re interested in Hausa, a good place to start is Uwe Seibert’s Hausa Online.

Also don’t forget that resources like the BBC have written/spoken content in Hausa, Somali, Swahili, Kinyarwanda/Kirundi.

Personally, I’m fascinated by this resource on information structure for 25 sub-Saharan languages.

DoBeS has a number of languages archived, those these tend to be small, highly endangered languages, so if you only want giant amounts of text, this is the wrong place, probably. On the other hand, if you want to have a huge impact…EMELD is another place people archive information.

From the LDC you might be interested in:

To make your own corpus from web texts, consider CorpusCollie (the example they use is Luo).

