Mandarin | Corpus linguistics

Tag Archives: Mandarin

Prosodically annotated corpora

Here’s a summary of corpora to check out if you’re interested in prosody. It’s really English-heavy. Send me ideas for non-English sources that are annotated!

For ToBI marked stuff:

The Boston University Radio Speech Corpus will get you student hosts reading the news. The transcripts are marked up with prosodic information (ToBI) for about 3.5 hours worth of data. One nice thing is that it has inter-rater reliability information on the prosodic annotations (see Hasegawa-Johnson et al., 2005 for more about that and an example of research using the corpus).
There’s also ToBI annotation for 75 Switchboard conversations in the NXT edition: http://groups.inf.ed.ac.uk/switchboard/

Other annotation systems:

You might check out the Santa Barbara Corpus is free now and is a great source for prosody research since it’s naturalistic and has a lot of different kinds of people talking in a lot of different situations. I’m not sure if anyone has ever annotated it with ToBI but the transcripts themselves have a host of prosodic cues.
The London-Lund Corpus has a lot of prosodic annotation, too.
The Hong Kong Corpus of Spoken English is naturalistic in that it’s all from real-life stuff (interviews, presentations, etc). You can get a flavor of it here but to get all the prosodic information, you need to get the book, here. It uses David Brazil’s Discourse Intonation system (prominence, tone, key, termination).
There’s also the Aix-MARSEC database, which is five hours of spoken British English with phonemes, syllables, syllable constituents, rhythm units, stress feet, words, and intonation units all marked up. (Get the data here, ready for Praat.)
The Wellington Corpus of Spoken New Zealand English has New Zealand English with emphatic stress marked.
The IViE corpus is labeled prosodically, too.

More of a stretch is the Audiovisual Database of Spoken American English. I don’t think most of you interested in prosody will care about this corpus, but I include it just in case.

Finally, in the universe of emotion and prosody, you can try out:

(See my previous posts on emotion here and here for other resources–note that the two above are both “acted”.)

Tags: english, Mandarin, prosody, stress, tobi

Comments 1 Comment
Categories Uncategorized

Ho ho ho, December’s new LDC corpora

6 Dec

December has brought us 18 DVDs worth of data.

Chinese Gigaword Fifth Edition (1 DVD)

Known to some of you as LDC2011T13, this is Mandarin Chinese newswire stuff. Here’s what the data looks like. If you’re working on Chinese, you probably want this.

2006 NIST Speaker Recognition Evaluation Training Set (7 DVDs)

“Honey, it’s your mother.” If you don’t recognize that voice, try developing a better algorithm on this training set: LDC2011S09. This is telephone speech mostly in English, but also in Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai, Urdu, and Yue Chinese. It’s 595 hours and there are English transcripts for the non-English parts.

You don’t have to be interested in only speaker identification to use this–ESL stuff, code-switching, and discourse studies would all make sense for the data here.

2006 NIST/USF Evaluation Resources for the VACE Program – Meeting Data Test Set Part 2 (10 DVDs)

In the catalog, this is called LDC2011V06 and I think you should probably follow that link. But basically you get 20 hours of meetings (held in research institutions in Pennsylvania, Virgina, Maryland, Scotland, Switzerland, and the Netherlands. (But all in English, I believe.)

For we linguists who normally work with just audio or text, this is a very rich video database. The VACE program’s goal was to extract video content automatically and to understand events. So there’s tracking of faces, hands, people, vehicles, and text. In other VACE corpora, you can get other meetings as well as broadcast news, street surveillance, unmanned aerial vehicle motion imagery. Uh, okay, so if you’re a linguist looking at unmanned aerial vehicle motion imagery, you should send me a note to tell me more. But for the rest of us, this meeting data shows group dynamics that could go in any number of directions.

Tags: Arabic, Bengali, Chinese, computational, english, Hindi, Korean, ldc, Mandarin, newswire, Russian, Thai, Urdu, voice recognition, Yue

Comments Leave a Comment
Categories Uncategorized

Emotion corpora

6 Nov

One of the common ways that phoneticians and other researchers have looked at emotion-in-language is by studying acted affect. That is, you get a bunch of people to read number lists or the alphabet in “angry” voice, “happy” voice, etc. Then you see if other people can reliably guess the emotion and then you go and look for the acoustic correlates.

If you’re interested in this sort of thing, you could try the Emotional Prosody Speech and Transcripts corpus (if you’re at Stanford and you’ve gotten corpus access, you’ll find it at /afs/ir/data/linguistic-data/EmotionalProsodySpeechAndTranscripts).

Now, there are a number of known issues with acted data–which is that it is stereotyped in particular ways. And if you wanted to detect what’s going on in a call center, “angry actors” wouldn’t help you nearly as much as “actual callers who are annoyed/disappointed/etc”. If you’re curious about more naturalistic corpora/research, here are some resources you might find useful (they’re all on my web page about emotions and language: http://www.stanford.edu/~tylers/emotions.shtml).

My talk at Nuance (the Dragon Naturally Speaking and Siri folks): http://www.stanford.edu/~tylers/notes/papers/emotion/Nuance_emotion_detection_11-17-10_final.pptx. This is basically an intro for dealing with naturalistic emotional data for speech scientists and others interested in detection/recongition.
Notes on Clavel and Devillers (2011): http://www.stanford.edu/~tylers/notes/emotion/Comp_speech_special_issue_2011_reading_notes_Schnoebelen.pdf
Notes on Cowie and Cornelius (2003): http://www.stanford.edu/~tylers/notes/emotion/Cowie_Cornelius_2003_reading_notes_Schnoebelen.pdf
Maybe my notes on Amir and Cohen (2007) and a few others: http://www.stanford.edu/~tylers/notes/emotion/Various_detection_articles_reading_notes_Schnoebelen.pdf
You might poke around http://emotion-research.net/ for some more naturalistic corpora that are being used by people interested in emotion research. (And let me know what you find that’s useful.)

11/7/2011 post-script: If acted data suits your needs, you can also consider something other than English–for example, the Mandarin Affective Speech corpus will get you Chinese.

Tags: emotion, english, Mandarin, phonetics, phonology, prosody

Comments 1 Comment
Categories Uncategorized

Top five LDC corpora

30 Oct

In this post, I’d like to start off reviewing some of the most popular corpora that the Linguistics Data Consortium provides–with a few possibilities for alternatives. If you have a favorite corpus send it in!

1. TIMIT Acoustic-Phonetic Continuous Speech Corpus

If you’re interested in speech recognition, here’s one of your main resources. It’s basically 630 people (8 American dialects) reading 10 “phonetically rich sentences”. Plus these are time-aligned with transcripts (orthographic and phonetic). It’s been hand-verified and it’s pre-split into training/test subsets.

2. Web 1T 5-gram Version 1

This is basically Google n-gram stuff for English (unigrams to 5-grams). So if you want collocates and word frequencies, this is pretty good. There are 1 trillion word tokens, after all.

95 billion sentences
13 million unigrams
1 billion 5-grams

This data was released in 2006, though, so there should be more up-to-date resources.

There’s also a 2010 (Mandarin) Chinese 5-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T06

A 2009 Japanese 7-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T08

And a 2009 “European” 5-gram on Czech Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25

3. CELEX2 (but why not try SUBTLEX?)

This corpus, circa 1996, gives you ASCII versions of three lexical databases for English, Dutch, and German. You get:

orthography variations
phonological stuff like syllables and stress
morphology
word class, argument structures
word frequency, lemma frequency (based on “recent and representative text corpora).

In truth, if you just want word counts for American English then consider using SUBTLEXus: http://subtlexus.lexique.org/. They make the case that CELEX is actually bad for relying on for frequency information (I’ll let you follow the link for their arguments against it and Kucera and Francis. Actually, if you go ahead and check out http://elexicon.wustl.edu/, you can download words (and non-words) with reaction times and all the morphology/phonology/syntax stuff that CELEX2 gives you.

4. TIDIGITS

Okay, I had never heard of this one. The main use for this corpurs is speech recognition–for digits. You get 111 men, 114 women, 50 boys, and 51 girls each pronouncing 77 different sequences of digits in 1982.

5. ECI Multilingual Text

So the European Corpus Initiative Multilingual Corpus 1 (ECI/MCI) has 46 subcorpora totally 92 million words (marked up but you can get the non-marked up stuff, too).

12 of the component corpora have parallel translated corpora from 2-9 other corpora.

Most of the stuff is journalistic, and there are some dictionaries, literature, and international organization publications/proceedings/reports. The stuff seems to come mostly from the 1980’s and early 1990’s.

Anyone have a favorite corpus of UN delegates talking and being translated into a bunch of different languages?

Languages available: Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin Chinese, Modern Greek, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Portuguese, Russian, Serbian, Slovenian, Spanish, Standard Malay, Swedish, Turkish

Tags: Albanian, Bulgarian, celex, Chinese, computational, Croatian, Czech, Danish, Dutch, english, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin, Mandarin Chinese, Modern Greek, morphology, ngram, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, parallel, phonetics, phonology, Portuguese, Russian, semantics, Serbian, Slovenian, Spanish, speech recognition, Standard Malay, subtlex, Swedish, syntax, translation, Turkish

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Prosodically annotated corpora

Ho ho ho, December’s new LDC corpora

Emotion corpora

Top five LDC corpora

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?