Archive | October, 2012

R R R: The statistical pirate’s life for me

30 Oct

There are a lot of resources to help you learn about statistics and doing them with R, here’s one taught by a guy I know and respect, Roger Peng:

https://www.coursera.org/course/compdata

It’s about 3-5 hours/week for 4 weeks. Go put sign yourself up and cruise around Coursera to see what else you might want to learn.

(Update: It’s not totally clear when it’ll be offered next. Check out Roger and pals’ Simply Statistics blog entry about a “simplified” version: http://simplystatistics.org/post/34563838584/computing-for-data-analysis-simply-statistics-edition.)

Sociolinguistic summary: news from NWAV 41

30 Oct

To get Linguists to dance, play songs that are not in English. #macarena #NWAV41 http://t.co/tloJETYh

One parting shot from Dennis Preston: “it’s not hard to teach adjectives to Texans. Just ask if they can put ‘-ass’ after it.”#NWAV41

New Ways of Analyzing Variation 41 (NWAV) just wrapped up on Sunday. It attracted 852 tweets from 58 different Twitter users. This post offers a quick analysis of the tweets (you can search on Twitter for the time being; here’s a basic spreadsheet I made for archive purposes). It also links to various corpora that were mentioned at the conference.

Want to know what got tweeted? Here’s a word cloud of everything with an #nwav41 (excluding urls):

Of course, this doesn’t really reflect what’s going on around NWAV, just what was tweeted about. And that’s highly skewed since 80% of the tweets came from 7 users:

Here are the talks I had more than 12 tweets about:

The presentation I gave was based on Twitter data from over 9.2m tweets, over 14k users:

  • Tyler Schnoebelen, David Bamman, & Jacob Eisenstein on Gender, styles, and social networks in Twitter.

Some corpora that got mentioned in various talks:

How do doctors and patients talk?

25 Oct

Norma Mendoza-Denton had a “special guest star” at her NWAV41 keynote: Ashley Hesson, who has been working on how doctors and patients talk (find Ashley’s website here, but I prefer this pic because it has a labcoat; here’s a link to joint work that was in Penn Working Papers).

Hesson used a Verilogue corpus, here’s a citation and a link:

  • Kozloff and Barnett (2007) have a corpus of medical consultations (doctors advising patients about diabetes, coronary artery disease, etc).
  • http://www.verilogue.com/

Python classes for free!

25 Oct

Sign up to get started with Python–it’s natural language toolkit (NLTK) makes it easy to answer lots of linguistic questions.

http://t.co/nU912Ew2

“Strength” in Presidential debates

23 Oct

Word clouds are pretty. Here’s what it looks like across presidential and vice-presidential debates from the first Kennedy-Nixon to the third Obama-Romney.

Frequency is kind of like the old grey mare of corpus linguistics–you don’t want to ride it too hard. Let’s try trotting out something just one notch more sophisticated. In this post, I’m going to try to answer Joshua Benton’s Twitter question from last night: how have strength and weakness been used in presidential debates?

In terms of absolute frequency, George W. Bush uses words related to strength more than any other presidential or vice-presidential candidate–here I’m combining strong/stronger/strongest/strength/strengthen/strengthened. GW has 76 uses. The next closest is Jimmy Carter with 69 uses.

A different way to look at the data is to say: let’s add up all of GW’s words and see if that 76 is a lot compared to everyone else. I couldn’t find a real corpus of debate transcripts, so I made one myself using raw data from here and here. I cleaned it up as much as I could in the night:

When we look at word counts for all the candidates, moderators, questioners, etc, we see that GW has 7.07% of all the words spoken. There are 752 uses of strong/strength/etc. So if everything were random, we’d expect him to use these words 7.07%*752=~53 times. So he *is* using it a lot (1.4 times more than we’d expect). But there are folks who use it even more. The biggest users (by “observed/expected”) are:

  • Mondale: 61 uses (14 expected)
  • Kennedy: 62 uses (24 expected)
  • Dukakis: 43 uses (19 expected)
  • Carter: 69 uses (32 expected)

The big avoiders–who use less than half what we’d expect given their overall word–are (in order of most constrained to least constrained):

  • Lehrer (he’s been part of 12 debates, so he’s got a lot of words–I’m only including people who have at least 1% of all the words in the corpus)
  • Biden
  • McCain
  • Palin
  • Ryan
  • Reagan

You might be wondering about the role of weak/weakness/etc. Well, there are far fewer uses of any of these words (only 75 total for everyone). For what it’s worth, these are the main users: Kemp, Ryan, Carter, Romney, Obama, and Perot. But I’d be careful with this since the counts are so low for all of them (Carter has the most, with 14 tokens–Ryan only uses it 3 times…so do you really want to read that much into his use?).

What about over time? Using the same kind of “observed/expected” logic but for years rather than speakers, we find that the big boom years of strength/etc were 1960, 1976, and 1984. The current year has about 60% of what we’d expect if everything were distributed at random (i.e., “observed/expected”). This has been a pretty big year for weak/etc, though (18 uses in reality instead of the 9 we’d expect–but again, much smaller counts).

Now for a few other odds and ends. First, word counts:

  • Obama has been in six debates. Mostly he has the same number of words as his rival.
  • He had about 1.9% more words than McCain across their three debates.
  • He had a 2.6% deficit in terms of Romney.
  • The first Obama-Romney debate was a difference of 7005 words (Obama) to 7742 (Romney). Actually, Obama had–proportionately–fewer words in the third debate: 7493 vs. Romney’s 8553. (So I wouldn’t jump to any “fewer words correspond to lousy performance”.)

On particular other words:

  • As you might have guessed, McCain and Biden loved talking about friend/friends/etc. So did Kemp, Palin, Lieberman, Romney, and George W.
  • The big users of leader/leadership/etc have been Dukakis, Ferraro, Mondale, Bentsen, Palin.
  • I was fascinated by how much Bill Clinton changed in his DNC speech this year–the Atlantic has a great visualization–I expected him to be a big users of “now,“. But he’s not the main one: Perot, Obama, Edwards, Kerry, Bentsen, and Gore are.
  • Everybody loves freedom, but especially Kennedy, Nixon, Palin, Cheney, George W., and Ryan. Liberty gets much less love (its proponents are George W., Lieberman, Ryan, Mondale, Clinton, and Dole).
  • In the debates, Romney has painted a dire picture, yet insisted he is an optimist. Optimist/optimistic/etc are used most by Bush Sr., Bush Jr. (a family value?!), Kemp, Dukakis, Reagan, and Romney.
  • I wanted to give you the great uh‘ers, but it looks like the transcribers weren’t consistent in transcribing it. But if you’re curious, Ford has 358 uh‘s and Carter has 507 (Carter does have more words overall, so just comparing them, you’d say that Ford was more uh-prone).

So there you go. Properly, I should be reading all of the examples and giving some interpretations. I’m going to leave that to you all. What do you see in digging through the data (or your memory banks)?

Romnesia–word play techniques

20 Oct

(Add your suggestions in comments!)

So one of the words hitting big around social media is “Romnesia”, referring to presidential candidate Mitt Romney’s forgetting of his position because it’s changed so many times.

Part of the delight in this word is that it’s a fun mishmashing of “Romney” and “amnesia”. If you enter “Romney” into a rhyming dictionary (e.g., RhymeZone), all you’re going to get is “omni” and that’s not very satisfying. How could we identify other candidates? That’s what this post is about.

Here, the CMU Pronunciation Dictionary is going to be very helpful. The current version has 133,315 entries (including “Romney” and many other proper names, btw; and a lot of obscure words, too). Each word is given as it is normally spelled and how it’s pronounced (here’s the guide to the various characters).

“Romney” is translated into “R AA1 M N IY0”. The zero means that there’s no stress on the second syllable, the 1 means that it’s the “aa” vowel in the first syllable that’s stressed. If there were a 2 in this word, it’d mean it had secondary stress.

So the first place to look for candidates is searching for “R AA1 M” words. That turns up 147 possibilities. My personal favorites:

  • Abercrombie–>Abercromney
  • CDROM–>CDRomney
  • Drama–>Dramney (or dromney…works better in speech than writing)
  • Dromedary–>Dromneydary (I want this to work for the visual)
  • Pogrom–>Pogromney
  • Prominently–>Promneynently (uh, okay, no)
  • Ramadan–>Romneydan/Ramneydan/Ramadomney

147 isn’t really that long of a list, but you can imagine putting in restrictions based on which syllable the word is and whether there’s some initial consonant cluster (for example, none of the tr’s and str’s work; sometimes non-initial stress works, but not when there’s too much stuff following Andromeda doesn’t really turn into Andromneyda).

We can also look at other consonants that are like “r” or “m”. R and L are both “liquids” while M and N are both nasals. So they are good possibilities. If we search for “r aa1 n”, there are 254 possibilities. But these don’t seem to work (Chronicle–>Chromnicle??). There are 62 possibilities for “l aa1 m”.

  • Agglomerate–>Aggromneyrate? (Agglomeromney seems like a better variant)
  • Conglomerate–>Conglomneyrate/Congromneyrate/Congromerateney? (Conglomeromney)
  • Llama–>LLamney (Okay, this doesn’t work at all, but I really want it to)
  • Salami–>Salamney
  • Salaam–>Salaamney

I’m going to skip the 150 “l aa1 n”‘s.

Now, what “Romnesia” actually does is:

  • Take a word that has stress on the “IY” syllable (amNEsia), even though that isn’t where the stress is on ROMney
  • It also has the nice m-n sequence.
  • It also starts with a vowel, which makes it particularly hospitable for a swap out.

So let’s look at other words that have “m n iy1” form–actually, it’s just amnesia and amnesiac. Okay, let’s loosen up to “n iy1”, there are 541 of those.

  • Aeneid–>Romneid (his heroic journey, for the literary)
  • Anemic–>Romnemic
  • Cantonese–>Cantoromnese (maaaaybe)
  • Designee–>Desigromney
  • El Niño–>El Romniño
  • Polynesia–>Polyromnesia
  • Needy–>Romneedy
  • Needlepoint–>Romneedlepoint
  • Neophytes–>Romneophytes

Update: My friend, Rick, came up with:

  • Insomnia–>InRomnia

Earlier, I had been looking for “m n iy1”, that is, stuff like “amnesia”, but this of course is another possibility, go for “aa1 m n iy”, where the “mn” pair is preceded by the main stress in “Romney” and then followed by an unstressed “iy”. The only words in the CMU pronunciation dictionary with this form are insomnia, insomniac, omni, Romney, and Romney’s. But this also suggests that the “iy” vowel may not be so important:

  • Romnimpotent
  • Romnipresent