November | 2011 | Corpus linguistics

Archive | November, 2011

What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more “scientific”.

David Beaven (University of Glasgow) has put together a really neat page of “keywords” from the Google Books ngram corpus. Go find out what your birth year was like:

http://www.scottishcorpus.ac.uk/corpus/diaview/

(There are two options–year by year or five-year clumps.)

Basically, he’s figuring out what words are occurring in, say, a five year time block much more often than they appear elsewhere. So 1940-1944 is characterized by WWII:

1940‑1944: nazi hitler aircraft planes grid tanks radio germans vitamin propaganda aluminum voltage automobile orchestra pilot gear altitude civilian coil polish roosevelt poland rubber outstanding vocational

Books in the first half of the 60’s talked a lot about the Cold War:

1960‑1964: communist communists eq shri communism ussr electrons ml atomic soviet electron radiation equations scattering amino amplitude ions nuclear frequencies rs equation pakistan coefficients approximation momentum

The most recent group shows business and Internet and a continued interest in gender:

2000‑2004: web global gender options software uk user strategies phone networks users kids gene eds humans focused implementation com option files network typically asian environmental button

You can click on the links to see the 25 years around the “peaks”, but you should really go do it from David’s site since it gives more options (and since this post is just a tiny sampling of what his site offers).

http://www.scottishcorpus.ac.uk/corpus/diaview/

Tags: birth, born, culture, english, fav, Google Books, Google Ngram, keywords, visualization, year

Comments 1 Comment
Categories Uncategorized

Vocabulary richness

21 Nov

The Corpora list (join or search it here, really, it’s full of stuff).

One recent discussion is about “TTR”, which is an old school way of measuring the lexical diversity of some text. The abbreviation stands for “type token ratio”, so basically you look at a text and say there are x many unique word types and then you divide that by the number of tokens.

That’s pretty easy to calculate, but as people on the list point out, what the hell are you going to use it for? Let’s say you want to compare some novels or you want to compare some transcribed speech from kids you’re worried about. The TTR is going to be really dependent on how much data you have. So if you want any sort of stats, you need to have equal size text samples. (So you’d sample each text for the number of tokens of your smallest text.)

As the thread points out, you probably want to check out Tweedie and Baayen (1998) on “How variable may a constant be? Measures of lexical richness in perspective“.

But in terms of actual implementation of TTR and alternative measures, I would steer you to chapter 6.5 of Baayen’s introduction to linguistic analysis in R.

(Also see Benjamin Allison’s post for some thoughts about how to measure vocabulary richness. And David Hoover’s 2003 work on vocabulary richness measures.)

Tags: measures, richness, ttr, type token ratio, vocab, vocabulary

Comments Leave a Comment
Categories Uncategorized

Who is the Sarah Palin of the Canterbury Tales?

17 Nov

People have different styles of communicating–we interpret these styles situationally (“he’s upset/flirting/pretending to be objective”), broadly (“he’s rich/poor/straight/gay/born in Detroit”), and/or as identity markers (“it’s cuz he’s a total bro“).

It occurred to me that a great place to look at style is in the Canterbury Tales because Chaucer is doing so much styling of his characters–making not just their stories but their poetics different from one another. This post gives me a chance to link you to some Middle English corpora, talk a bit about part-of-speech (POS) tagging, and to tell you which pilgrim in the Canterbury Tales is the most like Sarah Palin (okay, yes, in one particular dimension).

Some background

A few week agos, Eric Acton gave a presentation at NWAV about work he and Chris Potts had done on affective demonstratives and Sarah Palin. For example:

And Secretary Rice, having recently met with leaders on one side or the other there, also, still in these waning days of the Bush administration, trying to forge that peace, and that needs to be done, and that will be top of an agenda item, also, under a McCain-Palin administration.

If you look at Palin’s speech, she really has this/that/these/those all over the place. Everyone can and does use demonstratives to position not just physical objects (“hand me ‘that’ cup over there”, “‘this’ is my son”) but to take up stances towards things that draw them closer or push them farther away from the speaker and/or their audience. But some people, like Sarah Palin, use this device A LOT. We could say that it’s part of her style–part of what makes her, her.

There’s lots of interesting stuff to be said about these affective demonstratives and I recommend that you check out:

Lakoff, Robin. 1974. Remarks on this and that. In CLS 10, 345-356. Chicago: Chicago Linguistic Society. (Okay the first thing I list I haven’t actually been able to find…Chicago and/or Berkeley linguists, help!)
Mark Liberman’s various posts on the Language Log: here and here (and maybe here; Chris Potts also has a follow-up on the LL here).
Davis, Christopher and Christopher Potts. 2010. Affective demonstratives and the division of pragmatic labor. In Maria Aloni, Harald Bastiaanse, Tikitu de Jager, and Katrin Schulz, eds., Logic, Language, and Meaning: 17th Amsterdam Colloquium Revised Selected Papers, 42-52. Berlin: Springer.
Potts, Christopher and Florian Schwarz. 2010. Affective ‘this’. Linguistic Issues in Language Technology 3(5):1-30.
And we should hound Chris and Eric to post their NWAV talk, too.
If you want to know what sociolinguists are doing with style, btw, I have some resources on my website.

Corpora for Middle English

I have a secret plan to write up something about Old English corpora and then about Shakespeare and beyond, but for now, let’s stick to the late 10th century to the early 15th century corpora.

Penn has syntactically parsed English from various periods, including Middle English: http://www.ling.upenn.edu/hist-corpora/
At the University of Michigan, The Corpus of Middle English Prose and Verse: http://quod.lib.umich.edu/c/cme/
Maybe check out the International Computer Archive of Modern and Medieval English: http://icame.uib.no/
Other links: http://quod.lib.umich.edu/m/mec/digitMSS.html (“The York Doomsday Project”!…DOOM!)
Project Gutenberg (they have a lot more than just Middle English: http://www.gutenberg.org/browse/languages/enm)
Literature Online has a lot of great stuff from various eras, though I’ve had a damnable time getting access remotely (it works from on campus, though): http://lion.chadwyck.com/marketing/index.jsp

Part of speech tagging

If we want to investigate “this”, we’re in pretty good shape just using good old grepping (or Ctrl-F).

But “that” is tricky, because our hypothesis doesn’t really thing that the complementizer “that” is doing affective stuff (complementizer = “I pity the fool that falls in love with you”).

In other questions like this, you can use any of a number of POS taggers out there free and fairly easy to use. My tagger of choice is the Stanford POS Tagger. A few quick notes about it:

The highest accuracy comes with using “bidirectional-distsim-wsj-0-18.tagger” but it’s really really slow. Don’t use it. The “left3words-wsj-0-18.tagger” is almost as accurate and A LOT faster. I’m not kidding about this.
Once you’ve installed it, the basic command to get it going is as follows. Note that the “-mx300m” is about memory, so you may need to up the “300” part.

{call java at the start–the full path to your java if you don’t have it saved to your path} -mx300m -classpath {directory stuff}/postagger/pos_tagger_dir/stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model {directorystuff/}postagger/pos_tagger_dir/models/left3words-wsj-0-18.tagger -textFile {inputfile to tag} > {output file name}

But of course tagging Middle English poetry using a tagger that was trained on 20th century articles from the Wall Street Journal is not such a great idea. Actually, I was surprised at how well it did, but nevertheless: uff.

So yaaaaay, WordHoard to the rescue.

WordHoard corpora + tool

Now, you’ll have to download WordHoard and figure out how to use their interface, but I found it pretty easy to do stuff like find collocates, multiword expressions, compare texts (Wife of Bath vs. Knight; Hamlet vs. Twelfth Night, etc.). It also has lexicons for the various corpora it supports.

Those corpora are not just English–you also have Ancient Greek. From the website:

Early Greek Epic. This corpus includes Homer, Hesiod, and the Homeric Hymns in the original Greek, with English and/or German translations for all texts but Shield of Herakles.
Chaucer. We have all the works of Chaucer, including all of The Canterbury Tales.
Spenser. We have all of the poetical works of Spenser, including The Faerie Queene.
Shakespeare. We have all the works of Shakespeare, including all of his plays and poems.

These are POS tagged, which is going to let us distinguish between complementizer-that and demonstrative-that (for example).

Collocations (first steps in analysis)

One of the most fun things to do with corpora is to look at collocations–what words are showing up with what other words more than they should be by chance? “Black and white” is a collocation, “salt and pepper”, too. You’ll also find that “force”, “tool”, and “group” collocate with “powerful” (but not “strong”), while “support”, “ties”, and “relationships” are “strong” (but not “powerful”).

So the first question I have is, what gets collocated with “this” and “that”? To figure this out, I ask WordHoard to give me collocates that appear 1, 2, or 3 words after “the”, “this”, and “that (d)”. There are lots of ways of calculating strength of relationships, but for this mini-project I’m going to go for exploration more than proof. So what I do is drop out words that collocate strongly with “the” as well as “this” and/or “that”. What I care about are the words that seem to be more strongly associated with demonstratives than with the plain-ole determiner “the”. The thinking here is counterfactual: when someone uses this/that, they COULD have used “the” instead (and that is generally more common).

First, let’s look at “this”. I’m going to report things that occur with “this” 1.5 or more times more often than they occur with “the”. Note that there are 3.2 times as many “the” tokens than “this” tokens, so we’re talking about words that are REALLY combining with “this” a lot more than if the distributions were just random.

You’ll immediately notice that adjectives are showing up:

jolly, sorry, noble, little, wide, worthy, innocent, woeful, fresh

And other affectively oriented words:

curse, dread, miracle

Lots of social roles:

maid, merchant, duke, yeoman, summoner, maiden, dame, earl, monk, carpenter, marquis, messenger

And really interestingly, proper names, too, which goes with the narrative style of “this {social role}”, I think:

Damyan, Absolon, Griselda, Cambuskan, Nicholas, Palamon, Alla, Arcite, Melibee, John, Phebus, Emily

Other words that might get your attention:

world, matter, case, canon, answer, tale, sentence, present, need, creature, conclusion, marriage, treasure

Each of these is worth tracking down and doing close readings of the particular lines. But in general, it’s looking like Chaucer will be a rich place for tracking down affective demonstratives.

What about “that”? There are fewer of these that stand out (because there’s a lot more “this” than “that” among demonstratives). But words for “that” include:

other, word, one, ilk

I think you’ll intuitively get why “that one” and “that ilk” occur so much more than “the/this one” and “the/this ilk”. There’s some sort of othering/distancing going on.

The Google Ngram viewer is pretty good for tracing developments back to 1800 (it has older data but the guys who work on it don’t think it gets reliable til 1800-ish). We can see that “ilk” really does have a “thatness” to this day, while “the one” is more popular in general (just not in the Canterbury Tales). Again, these graphs are NOT for Chaucer’s time but show what’s happening more recently:

Okay, so how are the demonstratives distributed?

To answer the question, “which pilgrim uses demonstratives the most?” I took demonstratives and looked at how the various tellers in the Canterbury Tales use them. We can establish a base rate of demonstrative use across Chaucer (all his works or just the Canterbury Tales). We can do this relative to all words or relative to total-tokens-of-the-plus-that-plus-this. (A token is just an occurrence of a word.)

For example, whether you’re looking at all of Chaucer or just the Canterbury Tales, that makes up about 5% of all the the/this/that‘s. For any give pilgrim, we can look at their tale and count up the the/that/this. If 5% of the total are that, then nothing special is happening. Like the Knight–he has 933 the/this/that‘s. So we’d guess that he’s going to have 0.05*933=47 uses of that. In fact, he has 54. Not a big difference. (By the way, I’ll put in a table with all the counts for all the pilgrims if you’re interested. Make the request in the comments below.)

Now, lots of folks like the folksiness of Sarah Palin (other people call it pseudo-folksiness). If you’re familiar with the Canterbury Tales you may immediately wonder about the earthy Wife of Bath. By demonstrative use, the Wife of Bath is decidedly NOT the Sarah Palin of the Canterbury Tales. The Wife has 120 the/this/that, but only 4 of these are that‘s (we would’ve guessed about 6). She uses this a litle bit more than expected but not by much.

I would like to stress that she DOES use affective demonstratives–it’s hard not to. Here are some examples. The point is that she doesn’t use them all that often.

That gentil text kan I wel understonde (she’s talking about God telling people to multiply, line 29)

That oon thou shalt forgo, maugree thyne yen (something like “give it up, damn your eyes”, line 315)

That oon for love, that oother was for hate (line 749, oon=one and oother is just a much much better way to spell ‘other’)

Okay, so the Knight uses affective demonstratives at a normal rate, the Wife of Bath underuses them. WHO IS THE SARAH PALIN OF THE CANTERBURY TALES? There are two contenders when we just go by word counts and stats. The first contender is the Shipman. But looking at the data qualitatively shows that most of his that‘s and this‘s are pretty non-affective. By contrast…the Pardoner uses this and that all over the place (over twice as many that‘s as we’d expect, for example and about 1.25 as many this‘s). For example:

Withinne that develes temple in cursed wise (470)

Were dryven for that vice, it is no drede (507)

He hath a thousand slayn this pestilence (679)

And we wol sleen this false traytour Deeth (699)

Now this is part where I will let other people draw comparisons between the Pardoner and Sarah Palin. I will not. I will, however, refresh your memory that the story he tells is about drunk guys trying to kill Death and is generally seen as morally proper…though the Pardoner himself is seen as corrupt/corrupted/corrupting. But back to the first hand, the Pardoner really is one of the most fascinating characters in the Tales. Love her or hate her, that Sarah Palin is pretty fascinating, too.

Tags: affect, Canterbury Tales, Chaucer, demonstratives, emotion, english, Middle English, politics, pragmatics, Sarah Palin, WordHoard

Comments 2 Comments
Categories Uncategorized

Over 3k comments with sentiment coding

15 Nov

Just found this page (http://www.cyberemotions.eu/data.html) and thought I’d pass it along. If you go to the website, you can sign up to get access to their collection of:

BBC News forum posts: 2,594,745 comments from selected BBC News forums and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.
Digg post comments: 1,646,153 comments on Digg posts (typically highlighting news or technology stories) and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.
MySpace (social network site) comments: six sets of systematic samples (3 for the US and 3 for the UK) of all comments exchanged between pairs of friends (about 350 pairs for each UK sample and about 3,500 pairs for each US sample) from a total of >100,000 members and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.

Here are some examples of their classifications (although I think they just give you the average for each sentence):

hey witch wat cha been up too (scores: +ve: 2,3,1; -ve: 2,2,2)
omg my son has the same b-day as you lol (scores: +ve: 4,3,1; -ve: 1,1,1)
HEY U HAVE TWO FRIENDS!! (scores: +ve: 2,3,2; -ve: 1,1,1)
What’s up with that boy Carson? (scores: +ve: 1,1,1; -ve: 3,2,1)

Here’s the annotator agreement table for the MySpace stuff.

Previous emotion-judgement/annotation tasks have obtained higher inter-coder scores, but without strength measures and therefore having fewer categories (e.g., Wiebe et al., 2005). Moreover, one previous paper noted that inter-coder agreement was higher on longer (blog) texts (Gill, Gergle, French, & Oberlander, 2008), suggesting that obtaining agreement on the short texts here would be difficult. The appropriate type of inter-coder reliability statistic for this kind of data with multiple coders and varying differences between categories is Krippendorff’s α (Artstein & Poesio, 2008; Krippendorff, 2004). Using the numerical difference in emotion score as weights, the three coder α values were 0.5743 for positive and 0.5634 for negative sentiment. These values are positive enough to indicate that there is broad agreement between the coders but not positive enough (e.g., < 0.67. although precise limits are not applicable to Krippendorff’s α with weights) to suggest that the coders are consistently measuring a clear underlying construct. Nevertheless, using the average of the coders as the gold standard still seems to be a reasonable method to get sentiment strength estimates.

Table 1. Level of agreement between coders for the 1,041 evaluation comments (exact agreement, % of agreements within one class, mean percentage error, and Pearson correlation).

Comparison	+ve	+ve +/- 1 class	+ve mean % diff.	+ve corr	-ve	-ve +/- 1 class	-ve mean % diff.	-ve corr
Coder 1 vs. 2	51.0%	94.3%	.256	.564	67.3%	94.2%	.208	.643
Coder 1 vs. 3	55.7%	97.8%	.216	.677	76.3%	95.8%	.149	.664
Coder 2 vs. 3	61.4%	95.2%	.199	.682	68.2%	93.6%	.206	.639

Now, I’m not a huge fan of objective/subjective distinctions (see, for example, my review of computational linguistics stuff on emotion). But positive/negative and intensity do seem to be real, if incomplete dimensions and this might be a useful set of data.

Tags: bbc, computational, digg, emotion, english, myspace, pragmatics, sentiment

Comments Leave a Comment
Categories Uncategorized

Dealing with knockouts in R (ditching Goldvarb)

14 Nov

Once you’ve got your corpora all situated and annotated, then next part is analysis.

Robin Melnick offers this guest blog post for folks learning R–especially if you’re a sociolinguist moving off of Goldvarb (if this is the case, check out Daniel Johnson’s website for other resources: http://danielezrajohnson.com/index.html).

Heeeere’s Robin:

———

I wrote up these instructions for a sociolinguistics colleague at another institution who’s in the throes of moving her life from Goldvarb to R. Pretty straightforward stuff for you R veterans but possibly useful for anyone less experienced in regression analysis.

One nice feature of Goldvarb is that it automatically identifies “knockouts,” i.e., empty cells. R, on the other hand, lets you proceed with regression even when you have such an invariant factor. This typically results in a fixed effect with a falsely huge error and estimated beta coefficient, so you really want to remove these before you fit your model.

The good news is that it’s straightforward and easily done.

I’ll illustrate with some of John Rickford’s Bajan (Barbadian Creole) data, in particular where we were looking at question formation and the factors constraining (predicting) inversion.

My dataframe is ‘ba’.

My dependent variable is ‘inv’ (whether or not the given question token is inverted), with values ‘y’ and ‘n’.

For each predictor (independent variable) I want to see if there are any structural zeros. To do this we generate the table that Goldvarb does for you automatically. Let’s illustrate with auxiliary type, ‘aux’.

 > table(ba$aux,ba$var)

       y   n

  b   0  11

  d   9 623

  g   0   5

  l  27  87

  m   4  90

  x   0   5

  z   0  60

Just like Goldvarb we visually inspect the table to see which aux types we want to remove/keep. We have four “knockouts” (here, all in favor of non-inversion), leaving three with variation: ‘d’, ‘l’, and ‘m’ (these represent do-support, copula be, and modals). To keep only tokens with these:

 > ba = ba[ba$aux%in%c('d','l','m'),]

This says to replace ba with itself, keeping just those rows for which aux is among the three factor levels we want. To see that it worked, let’s look at the table again:

> table(ba$aux,ba$var)

      y   n

  b   0   0

  d   9 623

  g   0   0

  l  27  87

  m   4  90

  x   0   0

  z   0   0

We can see that tokens corresponding to the knockout levels have been removed. There is, however, one last step: In R, we need to actually tell it to entirely remove the levels from the factor, not just the corresponding tokens. We can now do that with:

> ba$aux = ba$aux[drop=T]

This tells R to remove from the factor any levels for which there are no corresponding tokens left. A final view of the table confirms we have what we want:

> table(ba$aux,ba$var)

      y   n

  d   9 623

  l  27  87

  m   4  90

Repeat this procedure for each of your factors. The approach above closely parallels what you do with Goldvarb – visual inspection, then manual encoding of what to remove. We can, however, write R code that does all of the above automatically:

> aux.table = table(ba$aux,ba$var)

> aux.keep  = aux.table[(aux.table[,1]>0) & (aux.table[,2]>0),]

> ba        = ba[ba$aux%in%rownames(aux.keep),]

> ba$aux    = ba$aux[drop=T]

The first line generates the table as before, but stores it (as aux.table). The second line creates a second table (aux.keep) from the first, only keeping those rows where both columns are non-zero (i.e., not a knockout). Now the names of the rows in this reduced table will be the names of the levels of the factor group that we want to keep. The third line now keeps only those rows in the dataframe for which aux is among the names of the rows in our little table (rownames(aux.keep)). The fourth line is as before where we then want to fully remove the levels for which no tokens remain. So you can see that this does exactly what we did before, just without having to do the visual inspection of the table. Then you’d still repeat this for each IV.

———————–

Any questions? Add a comment or write to Robin directly at rmelnick at stanford (you know the rest).

Tags: goldvarb, measures, R, statistics

Comments 2 Comments
Categories Uncategorized

Topic modeling. Also sex.

11 Nov

Earlier today I mentioned some fun stuff available at Infochimps, including a corpus of 4,700+ erotic stories. I don’t know *who* is clicking on that link, but in the spirit of “give your readers what they want”, I’m going to give you a sense of that corpus (you can skip down below–but WAIT!).

One of my favorite tools out there is the Topic Modeling Toolkit that the Stanford NLP Group has made available (Daniel Ramage and Evan Rosen, in particular).

Let’s say you have a set of texts (your corpus). This could be a set of Shakespearean plays or it could be a bunch of tweets or it could be answers to linguistic ideology surveys (“What does this person sound like?”).

You probably want to understand how these texts are similar and different from one another. If lexical clues are interesting to you (because they are host to particular sounds or meanings or relate to syntax, etc), then you might want to try topic modeling.

You really don’t have to be a programmer to use the Topic Modeling Toolkit (TMT)–the instructions are very clear: http://nlp.stanford.edu/software/tmt/tmt-0.3/. Unlike most forms of clustering, topic modeling lets you have words appear in more than one place and it’s more scientific and systematic than word clouds you might be tempted to use.

I’ve been using the TMT for a number of different projects. I’m not going to talk about them. Instead, a brand-new *exclusive to this blog* mini-project on erotica. I should also note that in my “real” work, I find “labeled LDA” most useful. What I show below is plain LDA, which looks at statistics to see what clusters together rather than looking at what clusters relative to some meta-data tag you might have. “Labeled LDA” is easy to do, too (the last section on this page).

Step 1: Get your corpus

In my case, I grabbed the following file that I found on Infochimps: http://www.infochimps.com/datasets/corpus-of-erotica-stories

Step 2: Clean it up

The nice thing about corpora you get from the LDC is that it’s pretty tidy. Not so with web-based stuff. But the corpus in question had extra tabs all over the place. Here are some simple UNIX commands to clean it up. (If you have a Windows machine, you might try getting Cygwin so you can do UNIX-y things on your PC.)

sed "s/\t/ /g" erotica_text.tsv > erotica_text_cleaned.tsv

Basically what this little code is doing is searching (s) for tabs (/\t/) and replacing them with spaces (/ /) everywhere (g) in a particular file (erotica_text.tsv). Then you say that you want the output (>) to go to a new file (erotica_text_cleaned.tsv). Here’s more on sed and tabs.

To make the TMT work out-of-the-box, I went ahead and added a column with a unique identifier to each line (each line=a different text/story). I did that in Excel because I think Excel is pretty handy.

Step 3: Run the TMT

Alright, strictly speaking, you need to install the TMT, copy the scala scripts and *slightly* edit them–that’s all covered here very well, so I’m not going to repeat it.

In my case, I ran an edited version of the example-1 script that Dan and Evan provide. This is enough data that my laptop didn’t really want to handle it, so I used the NLP machines. The key is editing the script to point to the ID column (1) and the text column (2 for me). I ran an edited version of their example-5 script so that I could figure out the best number of topics to have–the idea is that the more topics you add, the lower “perplexity score” you get, but at some point, you don’t gain that much from adding more topics. My drop-off point was at 15 topics. Next, I ran a version of their example-2 script (switching from the default 30 topics to 15).

Through all of this I dropped the 40 most frequent terms–that’s standard practice in computational linguistics, though you want to think about it carefully. For example, in my research on Twitter emoticons if I had done this, I would’ve dropped “lol” and “happy”, and for a study about what emoticons co-occur with that would’ve been silly. Since the present study is just exploration, I really just want to knock of frequent words so they don’t dominate and repeat across all the topics.

Step 4: Analyze

The TMT runs a bunch of iterations (by default, 1000). I grab the summary.txt file from the “01000” folder. What shows up?

Topic 00: S&M

will, mistress, slave, leather, more, tied, master, behind, around, feet, ass, again, pain, head, are, left, room, gag, very, pulled
So a lot of these are social roles and tools of the trade. There’s also spatial stuff (the “where” is important–room, behind, I’m not sure about left since I didn’t part-of-speech tag the corpus).
Will is almost certainly mostly about the future tense, though it is kind of fun that in a bondage scene you see it so much (I will break your will…). Notice also the use of more, very, and again keywords that may be worth pursuing.

Topic 01: Swingers

john, linda, susan, bill, wife, debbie, our, their, tom, paul, jeff, very, told, other, fuck, after, two, jack, janet, kathy
The names are all well and good (someone want to grab this for the American Name Society?) I’m guessing–without really looking at the data–that there’s some sort of marriage thing happening here with wife as well as our/their perhaps two, after, and other fit this as well. I’m curious about why told is there so much.

Topic 02: Body parts

I’m pretty sure that if I list all the words in this topic, the whole blog will be shoved into some restricted, spammy part of the Internet. So let me know if you want the words–I’ll send them to you. In addition to 13 body parts and a word that rhymes with sum, there’s began, started, pulled, off, around, lisa, hard.

Topic 03: G-rated body parts

face, head, body, around, man, off, eyes, against, again, little, legs, hands, away, there, feet, pain, hair, began, arms, through
Prepositions are pretty interesting when it comes to sex, aren’t they. I’m also curious about little. And all of this beginning/starting stuff.

Topic 04: Classroom hijinks

bottom, jennifer, been, mrs, school, girl, miss, girls, spanking, desk, sharon, again, more, very, before, class, still, two, after, will
The word I’m most interested in here is still.
Notice, btw, that spanking is not part of the S&M-y Topic 00 above.

Topic 05: ???

their, been, who, are, will, very, there, some, which, more, our, other, only, sex, any, first, than, after, most, even
Not sure what to make of this. Thoughts?

Topic 06: Family stuff

None of the words in this topic are “naughty”, however, once I tell you that they are occurring as a topic in an analysis of erotic stories, they are likely to induce a “whoa” reaction. Send me an email if you want the list.

Topic 07: Describing the acts

Here the words are body parts, actions, and evaluations (hot, hard, big).

Topic 08: Other names

mary, jim, sue, dave, jane, carol, beth, ann, sandy, cathy, donna, brad…also various names for female body parts, as well as hot, began, their, while, pulled

Topic 09: Star Trek!

been, alex, captain, their, janice, there, are, peter, beverly, looked, before, more, did, will, himself, well, eyes, only, deanna, off
Where eyes are appearing and what they are doing is probably worth pursuing.
Judging from the titles of the stories in the corpus, there are a number that put Commander Will Riker and/or Captain Jean-Luc Picard with Dr. Beverly Crusher and I recall one title putting Counselor Deanna Troi with Wesley Crusher (played by Will Wheaton).
I wonder about only.

Topic 10: Romance novel type descriptions of acts

Body parts, prepositions, and adverbs that are fairly tame (e.g., slowly).

Topic 11: Exposition

there, got, started, off, get, didn’t, went, after, some, around, told, did, came, really, looked, more, took, our, see, before
I think these may be about setting the scene beofre the action takes place?

Topic 12: Women and what they wear

Various clothing items and verbs that go with them (looked, look, wearing, took).

Topic 13: Are you exhausted yet? Some mini-project this is…

This is a sexual topic, but nothing too extreme.

Topic 14: ???

don’t, i’m, want, know, get, think, can, didn’t, how, it’s, are, right, going, well, really, you’re, good, there, yes, did
As with topic 05, I’m not quite sure what to do with these, though I think these are a little more coherent as a category–they seem very discoursey to me.

The main point here is that you should consider topic modeling as a way of exploring your data. I wonder if I succeeded in that.

In terms of the erotic topics, maybe I can claim that we’re showing how desires are constructed–what goes with what. I think it also gives you a sense of “keywords” worth further pursuit. For example, this is probably not the right corpus to use to inquire about the nature of “liberty” or dative alternations. But what about (i) Prepositions of Desire, (ii) reported and reporting speech, (iii) discourse markers, and (iv) names-and-sex (see also Arnold Zwicky’s “How to name a porn star” and Amy Perfors’ HotOrNot experiment with front/back vowels)?

Tags: computational, english, erotica, modeling, sex, statistics, topic models, topics

Comments 1 Comment
Categories Uncategorized

Monkeying around: Infochimps

11 Nov

It’s Friday, so it’s a great day to go explore Infochimps. Infochimps is kind of like a data clearinghouse. It has all sorts of stuff. For example:

Dirty words, taboo words, swear words, etc. I’ve found these lists particularly helpful in understanding how people are using Twitter data (btw, there’s COPIOUS stuff on Twitter on Infochimps).
- If you’re curious about swearing, check out Timothy Jay’s stuff since he’s been documenting swearing across the country for years.
- The Twitter-and-swearing stuff I mentioned is in my NWAV talk (4 MB) and Jacob Eisenstein is talking at Pitt today about social networks and swearing.
Speaking of bad words, you can also get all those Enron emails at Infochimps (but we have it at Stanford, too: /afs/ir/data/linguistic-data/Enron-Email-Corpus).
And speaking of voyeurism, there’s also a corpus of erotica available. (There’s some pretty out-there stuff in it, but saying that may only make you more interested. Update: a quick analysis of this corpus.)

Some of you eat with finger bowls and extended pinkies–does Infochimps have less smutty stuff?

One of the interests in the NLP group is understanding how academic collaboration works (for example, Johri et al., 2011). A similar question could be asked by looking at a million different syllabi.
There’s lots of newsgroup stuff out there (besides porn), for example: here and here.
And there are also summaries of headlines and news articles.

Finally, some stuff on accents:

Infochimps HAS lots of data and they also LINK to lots of data. For example, the Speech Accent Archive, which has over 1,200 people reading a paragraph in English. They’re from all over the place (native and non-native English speakers).
There’s also a meme on YouTube that you could use. Search for “What’s my accent?” (Here are some steps I wrote up about how to download YouTube audio so you can play with it in Praat).
- And, okay, the whole reason you should’ve been reading this post is to get the following link to “Pronunciation Manual“. If you haven’t seen it yet: it’s honey-badger good. Happy Friday!

Tags: computational, data, enron, erotica, taboo

Comments 1 Comment
Categories Uncategorized

Tgrep2 alternatives

7 Nov

If you’re a user of tgrep2, you may have noticed that it hasn’t been updated since 2005. That’s because the main guy behind it, Doug Rohde, has moved on to other things (outside of academia). The good news is that tgrep2 is holding up to the tests of time pretty well (see this intro post if you’re new to tgrep2). But usually when things aren’t being actively developed, they fall into disrepair. So what are the alternatives?

Tregex

The chief alternative is Tregex, which is a home-grown Stanford tool.

The good news:

Has a GUI, which makes exploring trees a lot better.
This is actually more than just pretty trees, though. It works fine with different languages–like Chinese and Arabic. So you can type your queries in without needing some sort of transcription.
Extra operators like “restricted dominance” (I want something that dominats something else through a particular set of categories)
Tgrep2 is only Unix, but Tregex is cross-platform (because it uses Java).

The mixed news:

Tregex doesn’t pre-index, so it’s doing a grep each time you search. In Tgrep2, you have to pre-index your corpus, so if someone hasn’t done that, then you have to figure that out and spend that time, but then pre-indexing makes all your searches faster.
So if you just have trees in a file, you’re ready to go with tregex. If you’ve got a big corpus, you’re probably going to be frustrated by speed.

The bad news:

Tgrep2 lets you do macros (“these are control verbs”). That’s not a current feature in Tregex.

TigerSearch

So I was getting ready to tell you about TigerSearch and sum it up by “you have to specify even more of the syntax than you do in Tgrep2, so that seems too bulky”, but the point is moot because TigerSearch isn’t maintained any more: http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/. (Note that TIGERSearch searches XML markup rather than Penn Treebank style.)

Know any other alternatives to Tgrep2?

Tags: queries, search, structure, syntax, tgrep, tgrep2, trees

Comments Leave a Comment
Categories Uncategorized

Emotion corpora

6 Nov

One of the common ways that phoneticians and other researchers have looked at emotion-in-language is by studying acted affect. That is, you get a bunch of people to read number lists or the alphabet in “angry” voice, “happy” voice, etc. Then you see if other people can reliably guess the emotion and then you go and look for the acoustic correlates.

If you’re interested in this sort of thing, you could try the Emotional Prosody Speech and Transcripts corpus (if you’re at Stanford and you’ve gotten corpus access, you’ll find it at /afs/ir/data/linguistic-data/EmotionalProsodySpeechAndTranscripts).

Now, there are a number of known issues with acted data–which is that it is stereotyped in particular ways. And if you wanted to detect what’s going on in a call center, “angry actors” wouldn’t help you nearly as much as “actual callers who are annoyed/disappointed/etc”. If you’re curious about more naturalistic corpora/research, here are some resources you might find useful (they’re all on my web page about emotions and language: http://www.stanford.edu/~tylers/emotions.shtml).

My talk at Nuance (the Dragon Naturally Speaking and Siri folks): http://www.stanford.edu/~tylers/notes/papers/emotion/Nuance_emotion_detection_11-17-10_final.pptx. This is basically an intro for dealing with naturalistic emotional data for speech scientists and others interested in detection/recongition.
Notes on Clavel and Devillers (2011): http://www.stanford.edu/~tylers/notes/emotion/Comp_speech_special_issue_2011_reading_notes_Schnoebelen.pdf
Notes on Cowie and Cornelius (2003): http://www.stanford.edu/~tylers/notes/emotion/Cowie_Cornelius_2003_reading_notes_Schnoebelen.pdf
Maybe my notes on Amir and Cohen (2007) and a few others: http://www.stanford.edu/~tylers/notes/emotion/Various_detection_articles_reading_notes_Schnoebelen.pdf
You might poke around http://emotion-research.net/ for some more naturalistic corpora that are being used by people interested in emotion research. (And let me know what you find that’s useful.)

11/7/2011 post-script: If acted data suits your needs, you can also consider something other than English–for example, the Mandarin Affective Speech corpus will get you Chinese.

Tags: emotion, english, Mandarin, phonetics, phonology, prosody

Comments 1 Comment
Categories Uncategorized

October corpora from LDC

3 Nov

(This is mostly for Stanford folks)

We get periodic shipments of new corpora from the LDC. These are always available for you to check out as DVDs (just follow steps for access here). We can also put these online so you can ssh into the Stanford servers and go to /afs/ir/data/linguistic-data.

But there’s a catch. We have a limited amount of space there–so to add something, we need to remove something. If any of these corpora–or any other corpora you know about–would be great to have online, send me a note.

Spanish Gigaword–third edition

The great thing about this corpus is that it is enormous. Depending upon your research project, you may or may not be as psyched about it being newswire text. It’s got everything the previous editions had, plus newer stuff–so it covers, roughly, the mid-1990’s til Dec 31, 2010.

Arabic Gigaword–fifth edition

Same basic deal as the Spanish Gigaword–it covers news in Arabic from 2000 until Dec 2010. Here’s what the marked-up content looks like: http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2011T11.jpg.

2008 NIST Speaker Recognition Evaluation Test Set

This is actually nine DVDs worth of data because it’s 942 hours of telephone speech and interviews. The telephone speech is multilingual–predominately English but bilinguals were recruited, so in the telephone conversations you also get Arabic, Bengali, Chinese, Egyptian Arabic, Farsi, Hindi, Italian, Japanese, Korean, Lao, Punjabi, Russian, Tagalog, Tamil, Thai, Urdu, Uzbek, Vietnamese, Wu Chinese, Yue Chinese. The interviews are just English.

You do get the transcripts, btw. The corpus was imagined to be for speech recognition, but there may be some really interesting code-switching stuff for people interested in bilingual data.

Tags: Arabic, Bengali, Chinese, computational, Egyptian Arabic, english, Farsi, gigaword, Hindi, Italian, Japanese, Korean, Lao, ldc, Punjabi, Russian, Spanish, speech recognition, Tagalog, Tamil, Thai, Urdu, Uzbek, Vietnamese, Wu Chinese, Yue

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

What were the cultural keywords when you were born?

Vocabulary richness

Who is the Sarah Palin of the Canterbury Tales?

Over 3k comments with sentiment coding

Dealing with knockouts in R (ditching Goldvarb)

Topic modeling. Also sex.

Monkeying around: Infochimps

Tgrep2 alternatives

Emotion corpora

October corpora from LDC

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?