Tag Archives: indigenous

Native American language resources

8 Feb

Yesterday I got a chance to hear Kayla Carpenter, Maryrose Barrios, and Justin Spence talk about preserving California Indian languages. (Kayla and Justin are grad students in the Berkeley linguistics department; Maryrose is an undergrad doing physics, including preservation work on really old audio records of native songs, stories, etc.)

If you’re a linguist, there’s all sorts of stuff to look at. If you’re a Native American, resources are getting easier and easier to get at. (There’s a lot of sensitivity to the idea that earlier work between researchers and community members ended up sending stuff into a black box, so current folks are trying to make both new and old materials more accessible for non-linguists.)

Thanks to Justin for sending me not just a list of resources but notes on them, too:

At a national level, you might want to check out the National Anthropological Archives at the Smithsonian and the American Philosophical Society (this latter one is where Sapir’s notes are and Sapir studied lots of languages around the turn of the last century and took really good notes).

California has historically had the greatest density of native languages and folks at Berkeley have been archiving stuff for a long time. There are four main archives:

  • P.A. Hearst Museum of Anthropology (it’s got pre-1950 audio stuff).
  • Bancroft Library (paper stuff, pre-1950)
  • Post-1950, you can consult the Berkeley Language Center (audio) and the Survey of California and Other Indian Languages (paper stuff). They’ve recently combined their catalogs to make searching easier: http://cla.berkeley.edu. There are a lot of digital resources here (scanned images and digital audio).
    • (It sounds like you can find records of what’s at the Hearst using CLA, too.)

Regional archives also have surprising stuff. Justin gives two examples:

  • Pliny Earle Goddard’s materials on Californian Athabaskan languages are mostly at Bancroft and the APS, but his Lassik notebooks are at the University of Washington (Melville Jacobs Papers collection, they are apparently marked up with Harry Hoijer’s annotations).
  • J.P. Harrington’s archives are mostly at the National Anthropological Archives, but the Barbareno Chumash materials are in the Santa Barbara Museum of Natural History.

Finally, Justin says the best summary of archival materials for languages of California is in Victor Golla’s recent book:

 

African language corpora

8 Feb

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest).

And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) squat about any African languages. My hope is this post will be useful for Africanists who are curious about big collections of spoken/written texts. But even more so, I’m hoping that researchers who stick to English and other languages with easy, familiar corpora will take an interest and start playing with the data. Fwiw, people who study African languages are really friendly–so if you’re, say, a computational linguist, it’ll be no problem to find you a buddy.

A number of the resources below should help you find out more about areas/languages outside of Africa, too.

Why study African languages?

Larry Hyman gives some answers here (my favorite part starts on page 14–Leggbo has subject-verb-object word order for affirmative sentences, but subject-object-verb word order in the negative). The Economist gives its answers here (of note: 600 million mobile phone users in Africa–more than Europe or America).

My own answers come from an initial shock of studying Zulu after getting lessons as an anniversary gift. The clicks are wonderful, it’s true. But have you seen the morphology? Zulu is a temptress. Okungumthakathi kuyangikhwifa (‘The damn witch is bewitching me.’)

My more recent work has taken me to Ethiopia–also a beautiful country and packed with very different, very understudied languages. In one village I get to work on three completely unrelated languages (Shabo, Majang, Shekkacho). It’ll be a while before there are any corpora of size on these languages (Shabo has 300 speakers), but Amharic is a major language with a sizable population. It’s Semitic, so if you squint your eyes, you can see a relationship to Arabic and Hebrew, but Amharic is more distant than those two. Cushitic influenced it a lot, so you get subject-object-verb order, for example. You also get ejectives (make a plosive sound like t/k/p/d, there–you breathed through them. Ejectives are like those but just use the air stored above your glottis).

Finding African language corpora

First a request: if you know of corpora based on naturalistic spoken conversations, please let me know.

Kevin Scannell makes several great resources available for people studying less common languages:

It’s worth trucking over to African Language Technology and searching their archives/posting questions. Right off the bat, it has a number of resources, including:

And it’s also worth checking out all the different resources at OLAC.
For Zulu, check out the Ukwabelana Corpus, prepped by folks doing computational linguistics.

For a treasure-trove of Ndebele/Zulu lexical info–and a huge number of other Bantu languages–take a look at CBOLD (if you have issues, there’s an old version at Berkeley).

ALLEX offers resources for Ndebele, Shona, and Nambya.

Moving away from Bantu, for Amharic, there’s a whole site for resources to help you: http://corpora.amharic.org/. (Speaking of Semitic languages, if you want Arabic, check out my earlier post here).

If you’re interested in Hausa, a good place to start is Uwe Seibert’s Hausa Online.

Also don’t forget that resources like the BBC have written/spoken content in Hausa, Somali, Swahili, Kinyarwanda/Kirundi.

Personally, I’m fascinated by this resource on information structure for 25 sub-Saharan languages.

DoBeS has a number of languages archived, those these tend to be small, highly endangered languages, so if you only want giant amounts of text, this is the wrong place, probably. On the other hand, if you want to have a huge impact…EMELD is another place people archive information.

From the LDC you might be interested in:

To make your own corpus from web texts, consider CorpusCollie (the example they use is Luo).