Tag Archives: indigenous

Native American language resources

Yesterday I got a chance to hear Kayla Carpenter, Maryrose Barrios, and Justin Spence talk about preserving California Indian languages. (Kayla and Justin are grad students in the Berkeley linguistics department; Maryrose is an undergrad doing physics, including preservation work on really old audio records of native songs, stories, etc.)

If you’re a linguist, there’s all sorts of stuff to look at. If you’re a Native American, resources are getting easier and easier to get at. (There’s a lot of sensitivity to the idea that earlier work between researchers and community members ended up sending stuff into a black box, so current folks are trying to make both new and old materials more accessible for non-linguists.)

Thanks to Justin for sending me not just a list of resources but notes on them, too:

At a national level, you might want to check out the National Anthropological Archives at the Smithsonian and the American Philosophical Society (this latter one is where Sapir’s notes are and Sapir studied lots of languages around the turn of the last century and took really good notes).

California has historically had the greatest density of native languages and folks at Berkeley have been archiving stuff for a long time. There are four main archives:

P.A. Hearst Museum of Anthropology (it’s got pre-1950 audio stuff).
Bancroft Library (paper stuff, pre-1950)
Post-1950, you can consult the Berkeley Language Center (audio) and the Survey of California and Other Indian Languages (paper stuff). They’ve recently combined their catalogs to make searching easier: http://cla.berkeley.edu. There are a lot of digital resources here (scanned images and digital audio).
- (It sounds like you can find records of what’s at the Hearst using CLA, too.)

Regional archives also have surprising stuff. Justin gives two examples:

Pliny Earle Goddard’s materials on Californian Athabaskan languages are mostly at Bancroft and the APS, but his Lassik notebooks are at the University of Washington (Melville Jacobs Papers collection, they are apparently marked up with Harry Hoijer’s annotations).
J.P. Harrington’s archives are mostly at the National Anthropological Archives, but the Barbareno Chumash materials are in the Santa Barbara Museum of Natural History.

Finally, Justin says the best summary of archival materials for languages of California is in Victor Golla’s recent book:

http://www.ucpress.edu/book.php?isbn=9780520266674

Tags: Achumawi, Athabaskan, Atsugewi, Awaswas, Barbareño, Cahuilla, California, Chalon, Chemehuevi, Chimariko, Chochenyo, Chumash, Cupeño, Esselen, Gabrielino, Hupa, indigenous, Ineseño, Juaneño, Karkin, Karuk, Kashaya, Kato, Kawaiisu, Kitanemuk, Konkow, Konomihu, Kumeyaay, Lassik, Luiseño, Maidu, Maricopa, Mattole, Miwok, Modoc, Mojave, Mono, Mutsun, native, Nisenan, Nomlaki, Obispeño, Okwanuchu, Paiute, Panamint, Patwin, Pomo, Purisimeño, Quechan, Ramaytush, Rumsen, Saclan, Salinan, Serrano, Shasta, Tamyen, Tataviam, Tolowa, Tubatulabal, Ventureño, Wappo, Washo, Wintu, Wiyot, Yana, Yokuts, Yukian, Yurok

Comments 1 Comment
Categories Uncategorized

African language corpora

8 Feb

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest).

And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) squat about any African languages. My hope is this post will be useful for Africanists who are curious about big collections of spoken/written texts. But even more so, I’m hoping that researchers who stick to English and other languages with easy, familiar corpora will take an interest and start playing with the data. Fwiw, people who study African languages are really friendly–so if you’re, say, a computational linguist, it’ll be no problem to find you a buddy.

A number of the resources below should help you find out more about areas/languages outside of Africa, too.

Why study African languages?

Larry Hyman gives some answers here (my favorite part starts on page 14–Leggbo has subject-verb-object word order for affirmative sentences, but subject-object-verb word order in the negative). The Economist gives its answers here (of note: 600 million mobile phone users in Africa–more than Europe or America).

My own answers come from an initial shock of studying Zulu after getting lessons as an anniversary gift. The clicks are wonderful, it’s true. But have you seen the morphology? Zulu is a temptress. Okungumthakathi kuyangikhwifa (‘The damn witch is bewitching me.’)

My more recent work has taken me to Ethiopia–also a beautiful country and packed with very different, very understudied languages. In one village I get to work on three completely unrelated languages (Shabo, Majang, Shekkacho). It’ll be a while before there are any corpora of size on these languages (Shabo has 300 speakers), but Amharic is a major language with a sizable population. It’s Semitic, so if you squint your eyes, you can see a relationship to Arabic and Hebrew, but Amharic is more distant than those two. Cushitic influenced it a lot, so you get subject-object-verb order, for example. You also get ejectives (make a plosive sound like t/k/p/d, there–you breathed through them. Ejectives are like those but just use the air stored above your glottis).

Finding African language corpora

First a request: if you know of corpora based on naturalistic spoken conversations, please let me know.

Kevin Scannell makes several great resources available for people studying less common languages:

http://indigenoustweets.com/blogs/ and http://indigenoustweets.com/ show that blogs and Twitter may be an especially good source of data for Setswana, Hausa, Somali, Malagasy, Kinyarwanda, and Oromo.
Kevin’s Crúbadán web crawler has also found a good amount of stuff on Malagasy, Somali, Kinyarwanda, Oromo, and Swahili.

It’s worth trucking over to African Language Technology and searching their archives/posting questions. Right off the bat, it has a number of resources, including:

Swahili part-of-speech tagger, which is based on the Helsinki Swahili Corpus. As long as we’re on Swahili, there’s also the Sawa Corpus.
Luganda parallel corpus
Luo machine translation
Northern Sotho part-of-speech tagger

And it’s also worth checking out all the different resources at OLAC.

For Zulu, check out the Ukwabelana Corpus, prepped by folks doing computational linguistics.

For a treasure-trove of Ndebele/Zulu lexical info–and a huge number of other Bantu languages–take a look at CBOLD (if you have issues, there’s an old version at Berkeley).

ALLEX offers resources for Ndebele, Shona, and Nambya.

Moving away from Bantu, for Amharic, there’s a whole site for resources to help you: http://corpora.amharic.org/. (Speaking of Semitic languages, if you want Arabic, check out my earlier post here).

If you’re interested in Hausa, a good place to start is Uwe Seibert’s Hausa Online.

Also don’t forget that resources like the BBC have written/spoken content in Hausa, Somali, Swahili, Kinyarwanda/Kirundi.

Voice of America has Afan Oromo, Amharic, Hausa, Kinyarwanda/Kirundi, Ndebele, Shona, Somali, Swahili, and Tigrigna
Wikipedia has sizeable amounts of content in Amharic, Afrikaans, Malagasy, Swahili, and Yoruba.

Personally, I’m fascinated by this resource on information structure for 25 sub-Saharan languages.

DoBeS has a number of languages archived, those these tend to be small, highly endangered languages, so if you only want giant amounts of text, this is the wrong place, probably. On the other hand, if you want to have a huge impact…EMELD is another place people archive information.

From the LDC you might be interested in:

Mawukakan lexicon
A vast Yoruba lexical database
Ngomba tone paradigms
Dschang lexicon: and tone paradigms

To make your own corpus from web texts, consider CorpusCollie (the example they use is Luo).

Tags: Africa, Afrikaans, Amharic, and, Bantu, Dschang, endangered, Ethiopia, fav, Hausa, indigenous, Kinyarwanda, Kirundi, Luganda, Luo, Malagasy, Mawukakan, Nambya, native, Ndebele, Ngomba, Oromo, Setswana, Shona, Somali, Sotho, South Africa, Swahili, Tigrigna, Yoruba, Zulu

Comments 5 Comments
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Native American language resources

African language corpora

Why study African languages?

Finding African language corpora

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?