Topic modeling. Also sex.

11 Nov

Earlier today I mentioned some fun stuff available at Infochimps, including a corpus of 4,700+ erotic stories. I don’t know *who* is clicking on that link, but in the spirit of “give your readers what they want”, I’m going to give you a sense of that corpus (you can skip down below–but WAIT!).

One of my favorite tools out there is the Topic Modeling Toolkit that the Stanford NLP Group has made available (Daniel Ramage and Evan Rosen, in particular).

Let’s say you have a set of texts (your corpus). This could be a set of Shakespearean plays or it could be a bunch of tweets or it could be answers to linguistic ideology surveys (“What does this person sound like?”).

You probably want to understand how these texts are similar and different from one another. If lexical clues are interesting to you (because they are host to particular sounds or meanings or relate to syntax, etc), then you might want to try topic modeling.

You really don’t have to be a programmer to use the Topic Modeling Toolkit (TMT)–the instructions are very clear: http://nlp.stanford.edu/software/tmt/tmt-0.3/. Unlike most forms of clustering, topic modeling lets you have words appear in more than one place and it’s more scientific and systematic than word clouds you might be tempted to use.

I’ve been using the TMT for a number of different projects. I’m not going to talk about them. Instead, a brand-new *exclusive to this blog* mini-project on erotica. I should also note that in my “real” work, I find “labeled LDA” most useful. What I show below is plain LDA, which looks at statistics to see what clusters together rather than looking at what clusters relative to some meta-data tag you might have. “Labeled LDA” is easy to do, too (the last section on this page).

Step 1: Get your corpus

In my case, I grabbed the following file that I found on Infochimps: http://www.infochimps.com/datasets/corpus-of-erotica-stories

Step 2: Clean it up

The nice thing about corpora you get from the LDC is that it’s pretty tidy. Not so with web-based stuff. But the corpus in question had extra tabs all over the place. Here are some simple UNIX commands to clean it up. (If you have a Windows machine, you might try getting Cygwin so you can do UNIX-y things on your PC.)

sed "s/\t/ /g" erotica_text.tsv > erotica_text_cleaned.tsv

Basically what this little code is doing is searching (s) for tabs (/\t/) and replacing them with spaces (/ /) everywhere (g) in a particular file (erotica_text.tsv). Then you say that you want the output (>) to go to a new file (erotica_text_cleaned.tsv). Here’s more on sed and tabs.

To make the TMT work out-of-the-box, I went ahead and added a column with a unique identifier to each line (each line=a different text/story). I did that in Excel because I think Excel is pretty handy.

Step 3: Run the TMT

Alright, strictly speaking, you need to install the TMT, copy the scala scripts and *slightly* edit them–that’s all covered here very well, so I’m not going to repeat it.

In my case, I ran an edited version of the example-1 script that Dan and Evan provide. This is enough data that my laptop didn’t really want to handle it, so I used the NLP machines. The key is editing the script to point to the ID column (1) and the text column (2 for me). I ran an edited version of their example-5 script so that I could figure out the best number of topics to have–the idea is that the more topics you add, the lower “perplexity score” you get, but at some point, you don’t gain that much from adding more topics. My drop-off point was at 15 topics. Next, I ran a version of their example-2 script (switching from the default 30 topics to 15).

Through all of this I dropped the 40 most frequent terms–that’s standard practice in computational linguistics, though you want to think about it carefully. For example, in my research on Twitter emoticons if I had done this, I would’ve dropped “lol” and “happy”, and for a study about what emoticons co-occur with that would’ve been silly. Since the present study is just exploration, I really just want to knock of frequent words so they don’t dominate and repeat across all the topics.

Step 4: Analyze

The TMT runs a bunch of iterations (by default, 1000). I grab the summary.txt file from the “01000” folder. What shows up?
  • Topic 00: S&M
    • will, mistress, slave, leather, more, tied, master, behind, around, feet, ass, again, pain, head, are, left, room, gag, very, pulled
    • So a lot of these are social roles and tools of the trade. There’s also spatial stuff (the “where” is important–room, behind, I’m not sure about left since I didn’t part-of-speech tag the corpus).
    • Will is almost certainly mostly about the future tense, though it is kind of fun that in a bondage scene you see it so much (I will break your will…). Notice also the use of more, very, and again keywords that may be worth pursuing.
  • Topic 01: Swingers
    • john, linda, susan, bill, wife, debbie, our, their, tom, paul, jeff, very, told, other, fuck, after, two, jack, janet, kathy
    • The names are all well and good (someone want to grab this for the American Name Society?) I’m guessing–without really looking at the data–that there’s some sort of marriage thing happening here with wife as well as our/their perhaps two, after, and other fit this as well. I’m curious about why told is there so much.
  • Topic 02: Body parts
    • I’m pretty sure that if I list all the words in this topic, the whole blog will be shoved into some restricted, spammy part of the Internet. So let me know if you want the words–I’ll send them to you. In addition to 13 body parts and a word that rhymes with sum, there’s began, started, pulled, off, around, lisa, hard.
  • Topic 03: G-rated body parts
    • face, head, body, around, man, off, eyes, against, again, little, legs, hands, away, there, feet, pain, hair, began, arms, through
    • Prepositions are pretty interesting when it comes to sex, aren’t they. I’m also curious about little. And all of this beginning/starting stuff.
  • Topic 04: Classroom hijinks
    • bottom, jennifer, been, mrs, school, girl, miss, girls, spanking, desk, sharon, again, more, very, before, class, still, two, after, will
    • The word I’m most interested in here is still.
    • Notice, btw, that spanking is not part of the S&M-y Topic 00 above.
  • Topic 05: ???
    • their, been, who, are, will, very, there, some, which, more, our, other, only, sex, any, first, than, after, most, even
    • Not sure what to make of this. Thoughts?
  • Topic 06: Family stuff
    • None of the words in this topic are “naughty”, however, once I tell you that they are occurring as a topic in an analysis of erotic stories, they are likely to induce a “whoa” reaction. Send me an email if you want the list.
  • Topic 07: Describing the acts
    • Here the words are body parts, actions, and evaluations (hot, hard, big).
  • Topic 08: Other names
    • mary, jim, sue, dave, jane, carol, beth, ann, sandy, cathy, donna, brad…also various names for female body parts, as well as hot, began, their, while, pulled
  • Topic 09: Star Trek!
    • been, alex, captain, their, janice, there, are, peter, beverly, looked, before, more, did, will, himself, well, eyes, only, deanna, off
    • Where eyes are appearing and what they are doing is probably worth pursuing.
    • Judging from the titles of the stories in the corpus, there are a number that put Commander Will Riker and/or Captain Jean-Luc Picard with Dr. Beverly Crusher and I recall one title putting Counselor Deanna Troi with Wesley Crusher (played by Will Wheaton).
    • I wonder about only.
  • Topic 10: Romance novel type descriptions of acts
    • Body parts, prepositions, and adverbs that are fairly tame (e.g., slowly).
  • Topic 11: Exposition
    • there, got, started, off, get, didn’t, went, after, some, around, told, did, came, really, looked, more, took, our, see, before
    • I think these may be about setting the scene beofre the action takes place?
  • Topic 12: Women and what they wear
    • Various clothing items and verbs that go with them (looked, look, wearing, took).
  • Topic 13: Are you exhausted yet? Some mini-project this is…
    • This is a sexual topic, but nothing too extreme.
  • Topic 14: ???
    • don’t, i’m, want, know, get, think, can, didn’t, how, it’s, are, right, going, well, really, you’re, good, there, yes, did
    • As with topic 05, I’m not quite sure what to do with these, though I think these are a little more coherent as a category–they seem very discoursey to me.
The main point here is that you should consider topic modeling as a way of exploring your data. I wonder if I succeeded in that.
In terms of the erotic topics, maybe I can claim that we’re showing how desires are constructed–what goes with what. I think it also gives you a sense of “keywords” worth further pursuit. For example, this is probably not the right corpus to use to inquire about the nature of “liberty” or dative alternations. But what about (i) Prepositions of Desire, (ii) reported and reporting speech, (iii) discourse markers, and (iv) names-and-sex (see also Arnold Zwicky’s “How to name a porn star” and Amy Perfors’ HotOrNot experiment with front/back vowels)?
Advertisements

One Response to “Topic modeling. Also sex.”

Trackbacks/Pingbacks

  1. Monkeying around: Infochimps « Corpus linguistics - November 13, 2011

    […] And speaking of voyeurism, there’s also a corpus of erotica available. (There’s some pretty out-there stuff in it, but saying that may only make you more interested. Update: a quick analysis of this corpus.) […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: