Tag Archives: computational

Sentiment corpus

25 Jan

Found this and thought I’d pass it along to folks interested in sentiment/opinion/emotion research: http://www.cyberemotions.eu/data.html.

If you’re at an academic institution, they’ll give you access to a variety of things tagged with sentiments. What you’ll get is a tag that is the average from three human beings. That’s not really a lot, though that is typical in the field right now. Each has a  1-5 positive strength score and a separate 1-5 negative strength score.

  • BBC News forum posts: 2,594,745 comments from selected BBC News forums and > 1,000 human classified sentiment strengths.
  • Digg post comments: 1,646,153 comments on Digg posts (typically highlighting news or technology stories) and > 1,000 human classified sentiment strengths.
  • MySpace (social network site) comments: six sets of systematic samples (3 for the US and 3 for the UK) of all comments exchanged between pairs of friends (about 350 pairs for each UK sample and about 3,500 pairs for each US sample) from a total of >100,000 members and > 1,000 human classified sentiment strengths.

Ho ho ho, December’s new LDC corpora

6 Dec

December has brought us 18 DVDs worth of data.

Chinese Gigaword Fifth Edition (1 DVD)

Known to some of you as LDC2011T13, this is Mandarin Chinese newswire stuff. Here’s what the data looks like. If you’re working on Chinese, you probably want this.

2006 NIST Speaker Recognition Evaluation Training Set (7 DVDs)

“Honey, it’s your mother.” If you don’t recognize that voice, try developing a better algorithm on this training set: LDC2011S09. This is telephone speech mostly in English, but also in Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai, Urdu, and Yue Chinese. It’s 595 hours and there are English transcripts for the non-English parts.

You don’t have to be interested in only speaker identification to use this–ESL stuff, code-switching, and discourse studies would all make sense for the data here.

2006 NIST/USF Evaluation Resources for the VACE Program – Meeting Data Test Set Part 2 (10 DVDs)

In the catalog, this is called LDC2011V06 and I think you should probably follow that link. But basically you get 20 hours of meetings (held in research institutions in Pennsylvania, Virgina, Maryland, Scotland, Switzerland, and the Netherlands. (But all in English, I believe.)

For we linguists who normally work with just audio or text, this is a very rich video database. The VACE program’s goal was to extract video content automatically and to understand events. So there’s tracking of faces, hands, people, vehicles, and text. In other VACE corpora, you can get other meetings as well as broadcast news, street surveillance, unmanned aerial vehicle motion imagery. Uh, okay, so if you’re a linguist looking at unmanned aerial vehicle motion imagery, you should send me a note to tell me more. But for the rest of us, this meeting data shows group dynamics that could go in any number of directions.

Over 3k comments with sentiment coding

15 Nov
Just found this page (http://www.cyberemotions.eu/data.html) and thought I’d pass it along. If you go to the website, you can sign up to get access to their collection of:
  • BBC News forum posts: 2,594,745 comments from selected BBC News forums and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.
  • Digg post comments: 1,646,153 comments on Digg posts (typically highlighting news or technology stories) and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.
  • MySpace (social network site) comments: six sets of systematic samples (3 for the US and 3 for the UK) of all comments exchanged between pairs of friends (about 350 pairs for each UK sample and about 3,500 pairs for each US sample) from a total of >100,000 members and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.
Here are some examples of their classifications (although I think they just give you the average for each sentence):
  • hey witch wat cha been up too (scores: +ve: 2,3,1; -ve: 2,2,2)
  • omg my son has the same b-day as you lol (scores: +ve: 4,3,1; -ve: 1,1,1)
  • HEY U HAVE TWO FRIENDS!! (scores: +ve: 2,3,2; -ve: 1,1,1)
  • What’s up with that boy Carson? (scores: +ve: 1,1,1; -ve: 3,2,1)
Here’s the annotator agreement table for the MySpace stuff.
Previous emotion-judgement/annotation tasks have obtained higher inter-coder scores, but without strength measures and therefore having fewer categories (e.g., Wiebe et al., 2005). Moreover, one previous paper noted that inter-coder agreement was higher on longer (blog) texts (Gill, Gergle, French, & Oberlander, 2008), suggesting that obtaining agreement on the short texts here would be difficult. The appropriate type of inter-coder reliability statistic for this kind of data with multiple coders and varying differences between categories is Krippendorff’s α (Artstein & Poesio, 2008; Krippendorff, 2004). Using the numerical difference in emotion score as weights, the three coder α values were 0.5743 for positive and 0.5634 for negative sentiment. These values are positive enough to indicate that there is broad agreement between the coders but not positive enough (e.g., < 0.67. although precise limits are not applicable to Krippendorff’s α with weights) to suggest that the coders are consistently measuring a clear underlying construct. Nevertheless, using the average of the coders as the gold standard still seems to be a reasonable method to get sentiment strength estimates.


Table 1. Level of agreement between coders for the 1,041 evaluation comments (exact agreement, % of agreements within one class, mean percentage error, and Pearson correlation).
+/- 1 class
+ve mean % diff.
+ve corr
+/- 1 class
-ve mean % diff.
-ve corr
Coder 1 vs. 2
Coder 1 vs. 3
Coder 2 vs. 3
Now, I’m not a huge fan of objective/subjective distinctions (see, for example, my review of computational linguistics stuff on emotion). But positive/negative and intensity do seem to be real, if incomplete dimensions and this might be a useful set of data.

Topic modeling. Also sex.

11 Nov

Earlier today I mentioned some fun stuff available at Infochimps, including a corpus of 4,700+ erotic stories. I don’t know *who* is clicking on that link, but in the spirit of “give your readers what they want”, I’m going to give you a sense of that corpus (you can skip down below–but WAIT!).

One of my favorite tools out there is the Topic Modeling Toolkit that the Stanford NLP Group has made available (Daniel Ramage and Evan Rosen, in particular).

Let’s say you have a set of texts (your corpus). This could be a set of Shakespearean plays or it could be a bunch of tweets or it could be answers to linguistic ideology surveys (“What does this person sound like?”).

You probably want to understand how these texts are similar and different from one another. If lexical clues are interesting to you (because they are host to particular sounds or meanings or relate to syntax, etc), then you might want to try topic modeling.

You really don’t have to be a programmer to use the Topic Modeling Toolkit (TMT)–the instructions are very clear: http://nlp.stanford.edu/software/tmt/tmt-0.3/. Unlike most forms of clustering, topic modeling lets you have words appear in more than one place and it’s more scientific and systematic than word clouds you might be tempted to use.

I’ve been using the TMT for a number of different projects. I’m not going to talk about them. Instead, a brand-new *exclusive to this blog* mini-project on erotica. I should also note that in my “real” work, I find “labeled LDA” most useful. What I show below is plain LDA, which looks at statistics to see what clusters together rather than looking at what clusters relative to some meta-data tag you might have. “Labeled LDA” is easy to do, too (the last section on this page).

Step 1: Get your corpus

In my case, I grabbed the following file that I found on Infochimps: http://www.infochimps.com/datasets/corpus-of-erotica-stories

Step 2: Clean it up

The nice thing about corpora you get from the LDC is that it’s pretty tidy. Not so with web-based stuff. But the corpus in question had extra tabs all over the place. Here are some simple UNIX commands to clean it up. (If you have a Windows machine, you might try getting Cygwin so you can do UNIX-y things on your PC.)

sed "s/\t/ /g" erotica_text.tsv > erotica_text_cleaned.tsv

Basically what this little code is doing is searching (s) for tabs (/\t/) and replacing them with spaces (/ /) everywhere (g) in a particular file (erotica_text.tsv). Then you say that you want the output (>) to go to a new file (erotica_text_cleaned.tsv). Here’s more on sed and tabs.

To make the TMT work out-of-the-box, I went ahead and added a column with a unique identifier to each line (each line=a different text/story). I did that in Excel because I think Excel is pretty handy.

Step 3: Run the TMT

Alright, strictly speaking, you need to install the TMT, copy the scala scripts and *slightly* edit them–that’s all covered here very well, so I’m not going to repeat it.

In my case, I ran an edited version of the example-1 script that Dan and Evan provide. This is enough data that my laptop didn’t really want to handle it, so I used the NLP machines. The key is editing the script to point to the ID column (1) and the text column (2 for me). I ran an edited version of their example-5 script so that I could figure out the best number of topics to have–the idea is that the more topics you add, the lower “perplexity score” you get, but at some point, you don’t gain that much from adding more topics. My drop-off point was at 15 topics. Next, I ran a version of their example-2 script (switching from the default 30 topics to 15).

Through all of this I dropped the 40 most frequent terms–that’s standard practice in computational linguistics, though you want to think about it carefully. For example, in my research on Twitter emoticons if I had done this, I would’ve dropped “lol” and “happy”, and for a study about what emoticons co-occur with that would’ve been silly. Since the present study is just exploration, I really just want to knock of frequent words so they don’t dominate and repeat across all the topics.

Step 4: Analyze

The TMT runs a bunch of iterations (by default, 1000). I grab the summary.txt file from the “01000” folder. What shows up?
  • Topic 00: S&M
    • will, mistress, slave, leather, more, tied, master, behind, around, feet, ass, again, pain, head, are, left, room, gag, very, pulled
    • So a lot of these are social roles and tools of the trade. There’s also spatial stuff (the “where” is important–room, behind, I’m not sure about left since I didn’t part-of-speech tag the corpus).
    • Will is almost certainly mostly about the future tense, though it is kind of fun that in a bondage scene you see it so much (I will break your will…). Notice also the use of more, very, and again keywords that may be worth pursuing.
  • Topic 01: Swingers
    • john, linda, susan, bill, wife, debbie, our, their, tom, paul, jeff, very, told, other, fuck, after, two, jack, janet, kathy
    • The names are all well and good (someone want to grab this for the American Name Society?) I’m guessing–without really looking at the data–that there’s some sort of marriage thing happening here with wife as well as our/their perhaps two, after, and other fit this as well. I’m curious about why told is there so much.
  • Topic 02: Body parts
    • I’m pretty sure that if I list all the words in this topic, the whole blog will be shoved into some restricted, spammy part of the Internet. So let me know if you want the words–I’ll send them to you. In addition to 13 body parts and a word that rhymes with sum, there’s began, started, pulled, off, around, lisa, hard.
  • Topic 03: G-rated body parts
    • face, head, body, around, man, off, eyes, against, again, little, legs, hands, away, there, feet, pain, hair, began, arms, through
    • Prepositions are pretty interesting when it comes to sex, aren’t they. I’m also curious about little. And all of this beginning/starting stuff.
  • Topic 04: Classroom hijinks
    • bottom, jennifer, been, mrs, school, girl, miss, girls, spanking, desk, sharon, again, more, very, before, class, still, two, after, will
    • The word I’m most interested in here is still.
    • Notice, btw, that spanking is not part of the S&M-y Topic 00 above.
  • Topic 05: ???
    • their, been, who, are, will, very, there, some, which, more, our, other, only, sex, any, first, than, after, most, even
    • Not sure what to make of this. Thoughts?
  • Topic 06: Family stuff
    • None of the words in this topic are “naughty”, however, once I tell you that they are occurring as a topic in an analysis of erotic stories, they are likely to induce a “whoa” reaction. Send me an email if you want the list.
  • Topic 07: Describing the acts
    • Here the words are body parts, actions, and evaluations (hot, hard, big).
  • Topic 08: Other names
    • mary, jim, sue, dave, jane, carol, beth, ann, sandy, cathy, donna, brad…also various names for female body parts, as well as hot, began, their, while, pulled
  • Topic 09: Star Trek!
    • been, alex, captain, their, janice, there, are, peter, beverly, looked, before, more, did, will, himself, well, eyes, only, deanna, off
    • Where eyes are appearing and what they are doing is probably worth pursuing.
    • Judging from the titles of the stories in the corpus, there are a number that put Commander Will Riker and/or Captain Jean-Luc Picard with Dr. Beverly Crusher and I recall one title putting Counselor Deanna Troi with Wesley Crusher (played by Will Wheaton).
    • I wonder about only.
  • Topic 10: Romance novel type descriptions of acts
    • Body parts, prepositions, and adverbs that are fairly tame (e.g., slowly).
  • Topic 11: Exposition
    • there, got, started, off, get, didn’t, went, after, some, around, told, did, came, really, looked, more, took, our, see, before
    • I think these may be about setting the scene beofre the action takes place?
  • Topic 12: Women and what they wear
    • Various clothing items and verbs that go with them (looked, look, wearing, took).
  • Topic 13: Are you exhausted yet? Some mini-project this is…
    • This is a sexual topic, but nothing too extreme.
  • Topic 14: ???
    • don’t, i’m, want, know, get, think, can, didn’t, how, it’s, are, right, going, well, really, you’re, good, there, yes, did
    • As with topic 05, I’m not quite sure what to do with these, though I think these are a little more coherent as a category–they seem very discoursey to me.
The main point here is that you should consider topic modeling as a way of exploring your data. I wonder if I succeeded in that.
In terms of the erotic topics, maybe I can claim that we’re showing how desires are constructed–what goes with what. I think it also gives you a sense of “keywords” worth further pursuit. For example, this is probably not the right corpus to use to inquire about the nature of “liberty” or dative alternations. But what about (i) Prepositions of Desire, (ii) reported and reporting speech, (iii) discourse markers, and (iv) names-and-sex (see also Arnold Zwicky’s “How to name a porn star” and Amy Perfors’ HotOrNot experiment with front/back vowels)?

Monkeying around: Infochimps

11 Nov
It’s Friday, so it’s a great day to go explore Infochimps. Infochimps is kind of like a data clearinghouse. It has all sorts of stuff. For example:

Some of you eat with finger bowls and extended pinkies–does Infochimps have less smutty stuff?

Finally, some stuff on accents:

  • Infochimps HAS lots of data and they also LINK to lots of data. For example, the Speech Accent Archive, which has over 1,200 people reading a paragraph in English. They’re from all over the place (native and non-native English speakers).
  • There’s also a meme on YouTube that you could use. Search for “What’s my accent?” (Here are some steps I wrote up about how to download YouTube audio so you can play with it in Praat).
    • And, okay, the whole reason you should’ve been reading this post is to get the following link to “Pronunciation Manual“. If you haven’t seen it yet: it’s honey-badger good. Happy Friday!

October corpora from LDC

3 Nov

(This is mostly for Stanford folks)

We get periodic shipments of new corpora from the LDC. These are always available for you to check out as DVDs (just follow steps for access here). We can also put these online so you can ssh into the Stanford servers and go to /afs/ir/data/linguistic-data.

But there’s a catch. We have a limited amount of space there–so to add something, we need to remove something. If any of these corpora–or any other corpora you know about–would be great to have online, send me a note.

Spanish Gigaword–third edition

The great thing about this corpus is that it is enormous. Depending upon your research project, you may or may not be as psyched about it being newswire text. It’s got everything the previous editions had, plus newer stuff–so it covers, roughly, the mid-1990’s til Dec 31, 2010.

Arabic Gigaword–fifth edition

Same basic deal as the Spanish Gigaword–it covers news in Arabic from 2000 until Dec 2010. Here’s what the marked-up content looks like: http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2011T11.jpg.

2008 NIST Speaker Recognition Evaluation Test Set

This is actually nine DVDs worth of data because it’s 942 hours of telephone speech and interviews. The telephone speech is multilingual–predominately English but bilinguals were recruited, so in the telephone conversations you also get  Arabic, Bengali, Chinese, Egyptian Arabic, Farsi, Hindi, Italian, Japanese, Korean, Lao, Punjabi, Russian, Tagalog, Tamil, Thai, Urdu, Uzbek, Vietnamese, Wu Chinese, Yue Chinese. The interviews are just English.

You do get the transcripts, btw. The corpus was imagined to be for speech recognition, but there may be some really interesting code-switching stuff for people interested in bilingual data.


Top five LDC corpora

30 Oct

In this post, I’d like to start off reviewing some of the most popular corpora that the Linguistics Data Consortium provides–with a few possibilities for alternatives. If you have a favorite corpus send it in!

1. TIMIT Acoustic-Phonetic Continuous Speech Corpus

If you’re interested in speech recognition, here’s one of your main resources. It’s basically 630 people (8 American dialects) reading 10 “phonetically rich sentences”. Plus these are time-aligned with transcripts (orthographic and phonetic). It’s been hand-verified and it’s pre-split into training/test subsets.

2. Web 1T 5-gram Version 1

This is basically Google n-gram stuff for English (unigrams to 5-grams). So if you want collocates and word frequencies, this is pretty good. There are 1 trillion word tokens, after all.

  • 95 billion sentences
  • 13 million unigrams
  • 1 billion 5-grams

This data was released in 2006, though, so there should be more up-to-date resources.

There’s also a 2010 (Mandarin) Chinese 5-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T06

A 2009 Japanese 7-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T08

And a 2009 “European” 5-gram on Czech Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25

3. CELEX2 (but why not try SUBTLEX?)

This corpus, circa 1996, gives you ASCII versions of three lexical databases for English, Dutch, and German. You get:

  • orthography variations
  • phonological stuff like syllables and stress
  • morphology
  • word class, argument structures
  • word frequency, lemma frequency (based on “recent and representative text corpora).

In truth, if you just want word counts for American English then consider using SUBTLEXus: http://subtlexus.lexique.org/. They make the case that CELEX is actually bad for relying on for frequency information (I’ll let you follow the link for their arguments against it and Kucera and Francis. Actually, if you go ahead and check out http://elexicon.wustl.edu/, you can download words (and non-words) with reaction times and all the morphology/phonology/syntax stuff that CELEX2 gives you.


Okay, I had never heard of this one. The main use for this corpurs is speech recognition–for digits. You get 111 men, 114 women, 50 boys, and 51 girls each pronouncing 77 different sequences of digits in 1982.

5. ECI Multilingual Text

So the European Corpus Initiative Multilingual Corpus 1 (ECI/MCI) has 46 subcorpora totally 92 million words (marked up but you can get the non-marked up stuff, too).

12 of the component corpora have parallel translated corpora from 2-9 other corpora.

Most of the stuff is journalistic, and there are some dictionaries, literature, and international organization  publications/proceedings/reports. The stuff seems to come mostly from the 1980’s and early 1990’s.

Anyone have a favorite corpus of UN delegates talking and being translated into a bunch of different languages?

Languages available: Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin Chinese, Modern Greek, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Portuguese, Russian, Serbian, Slovenian, Spanish, Standard Malay, Swedish, Turkish