There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest).
And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) squat about any African languages. My hope is this post will be useful for Africanists who are curious about big collections of spoken/written texts. But even more so, I’m hoping that researchers who stick to English and other languages with easy, familiar corpora will take an interest and start playing with the data. Fwiw, people who study African languages are really friendly–so if you’re, say, a computational linguist, it’ll be no problem to find you a buddy.
A number of the resources below should help you find out more about areas/languages outside of Africa, too.
Why study African languages?
Larry Hyman gives some answers here (my favorite part starts on page 14–Leggbo has subject-verb-object word order for affirmative sentences, but subject-object-verb word order in the negative). The Economist gives its answers here (of note: 600 million mobile phone users in Africa–more than Europe or America).
My own answers come from an initial shock of studying Zulu after getting lessons as an anniversary gift. The clicks are wonderful, it’s true. But have you seen the morphology? Zulu is a temptress. Okungumthakathi kuyangikhwifa (‘The damn witch is bewitching me.’)
My more recent work has taken me to Ethiopia–also a beautiful country and packed with very different, very understudied languages. In one village I get to work on three completely unrelated languages (Shabo, Majang, Shekkacho). It’ll be a while before there are any corpora of size on these languages (Shabo has 300 speakers), but Amharic is a major language with a sizable population. It’s Semitic, so if you squint your eyes, you can see a relationship to Arabic and Hebrew, but Amharic is more distant than those two. Cushitic influenced it a lot, so you get subject-object-verb order, for example. You also get ejectives (make a plosive sound like t/k/p/d, there–you breathed through them. Ejectives are like those but just use the air stored above your glottis).
Finding African language corpora
First a request: if you know of corpora based on naturalistic spoken conversations, please let me know.
- http://indigenoustweets.com/blogs/ and http://indigenoustweets.com/ show that blogs and Twitter may be an especially good source of data for Setswana, Hausa, Somali, Malagasy, Kinyarwanda, and Oromo.
- Kevin’s Crúbadán web crawler has also found a good amount of stuff on Malagasy, Somali, Kinyarwanda, Oromo, and Swahili.
It’s worth trucking over to African Language Technology and searching their archives/posting questions. Right off the bat, it has a number of resources, including:
- Swahili part-of-speech tagger, which is based on the Helsinki Swahili Corpus. As long as we’re on Swahili, there’s also the Sawa Corpus.
- Luganda parallel corpus
- Luo machine translation
- Northern Sotho part-of-speech tagger
ALLEX offers resources for Ndebele, Shona, and Nambya.
If you’re interested in Hausa, a good place to start is Uwe Seibert’s Hausa Online.
- Voice of America has Afan Oromo, Amharic, Hausa, Kinyarwanda/Kirundi, Ndebele, Shona, Somali, Swahili, and Tigrigna
- Wikipedia has sizeable amounts of content in Amharic, Afrikaans, Malagasy, Swahili, and Yoruba.
Personally, I’m fascinated by this resource on information structure for 25 sub-Saharan languages.
DoBeS has a number of languages archived, those these tend to be small, highly endangered languages, so if you only want giant amounts of text, this is the wrong place, probably. On the other hand, if you want to have a huge impact…EMELD is another place people archive information.
From the LDC you might be interested in:
- Mawukakan lexicon
- A vast Yoruba lexical database
- Ngomba tone paradigms
- Dschang lexicon: and tone paradigms