Paul Ryan dislikes Trump almost as much as Cruz does: On (not) naming names at the conventions

3 Aug

Donald Trump isn’t “there yet” in supporting Republican Speaker of the House Paul Ryan. Ryan has repeatedly chastised Trump and famously said he wasn’t there yet on Trump earlier this year. How strong is his distaste? Did he (and his writers) overcome it at the RNC Convention?

Let’s examine this question by looking at how people refer to other people (I’ll be using the term ‘referents’ and building on my previous post about the most DNC-y and RNC-y words).

Donald-Trump-and-Paul-Ryan-640x480

What children and spouses do and don’t do

If Paul Ryan showed distaste for Trump in his choice of referents, then we might expect that they will be pretty different than how someone who loves/likes someone behaves. Families have all sorts of interesting dynamics, but let’s begin by assuming that spouses and children speaking at a major convention will perform love. So let’s start with them. I’ve described all the referents in sections below, but they can be summarized this way:

  • Children don’t refer to their parents by their first name or their last name only
  • Spouses don’t refer to their partners by last name only (but obviously first name only is fine)
  • Full names are available for children (and spouses) to use for rhetorical effect
  • Basically all your pronouns refer to the nominee who is your relative
  • You don’t refer to the other candidate at all

What does the average DNC/RNC person do with first/last names?

Let’s look at the RNC/DNC speeches that aren’t given by family members (or the nominees).

Across 51 RNC speeches, there are 160 Donald Trump‘s, 4 Donald J. Trump‘s. There are 4 Mr. Trump‘s, 1 Fred Trump, 2 President Trump‘s; there are 22 other Trump uses and 30 other Donald’s. Across 42 non-Clinton DNC speeches, there are 145 Hillary Clinton‘s, 10 other Clinton‘s, and 130 other Hillary‘s. In other words…there’s a lot more naming of Hillary Clinton at the DNC.

There are six RNC speakers who don’t use Donald or Trump at all: Cotton, Ernst, Kirk, Mukasey, Perry, and Sullivan. Across 42 DNC speakers, only one of them doesn’t use Hillary or Clinton: Kareem Abdul-Jabbar. In a moment, we’ll tackle whether counts of names are a good measure. Before I do that, let me propose something slightly more complicated than simple counts.

How could we measure naming preferences?

Different speakers have different areas of focus and they have different word counts, so at a minimum we should take those into consideration. We’re on pretty firm footing, it would seem, to say that the convention at a convention is to mention the nominee. So the longer a speech goes without saying their name, the odder it is.

I’m interested in whether tf-idf can be used to quantify this oddness. Normally, tf-idf (“term frequency-inverse document frequency) is used for things like search retrieval–if you’re searching for a term across a bunch of documents, tf-idf is one way to figure out you should show Document X but not Document Y. It’s also a way of quantifying “aboutness”.

Across the RNC speakers who aren’t Trumps, the average tf-idf of Donald is 0.055 (median 0.045). The average for Trump is 0.052 (median 0.051). The average tf-idf at the DNC for Hillary is 0.077 (median 0.071), the average tf-idf there for Clinton is 0.042 (median 0.037).

Sniff test #1: Do the biggest scores pick out big supporters?

Let’s look at how has a high tf-idf score for the nominee names:

  • Hillary
    • Alison Lundergan Grimes (0.20 tf-idf; 8 Hillary‘s, 2 Hillary Clinton‘s, 1 President Clinton (Bill, not counted here))
    • Gabby Giffords (0.16 tf-idf; 3 Hillary‘s)
    • Ryan Moore (0.15 tf-idf; 6 Hillary‘s, 2 Hillary Clinton‘s)
  • Donald
    • Dana White (0.20 tf-idf; 5 Donald Trump‘s, 4 Donald‘s)
    • Rick Scott (0.14 tf-idf; 5 Donald Trump‘s, 1 Donald)
    • Laura Ingraham (0.13 tf-idf; 6 Donald Trump‘s)
  • Clinton
    • Joe Sweeney (0.11 tf-idf; 2 Hillary Clinton‘s, 2 Secretary Clinton‘s, 1 Hillary)
    • Tom Harkin (0.11 tf-idf; 6 Hillary Clinton‘s)
    • O’Malley (0.11 tf-idf; 7 Hillary Clinton‘s)
  • Trump
    • Kerry Woolard (0.14 tf-idf; 5 Donald Trump‘s, 3 the Trumps (about the family, not counted here), 2 Trump Winery (counted))
    • Laura Ingraham (0.13 tf-idf for this term, too, see above)
    • Harold Hamm (0.13 tf-idf; 4 Donald Trump‘s, 2 President Trump‘s)

If tf-idf of the nominees names is important, than these people should be among the biggest proponents. Is that true?

Well, Alison Lundergan Grimes is described as a “close family friend“. Gabby Giffords had a very short speech since it’s hard for her to speak but she has endorsed Clinton since January 10th. Ryan Moore met Hillary Clinton when he was seven years old.

The people who have high tf-idf scores for Clinton are less clear to me. Retired Iowa Senator Tom Harkin endorsed Hillary Clinton in August of 2015, which is an early endorsement. But Martin O’Malley ran against Clinton and hasn’t always had nice things to say, but he includes her passionately from start to end. Finally, I’m not quite as sure what to do with NYPD detective Joe Sweeney–is he an ardent, long-term Clinton supporter? That’s a real question. But in the meantime, every mention of the Democratic nominee in his speech is glowing.

Trump has been good to president of the UFC, Dana White over the years. Rick Scott only endorsed Trump after he had already won the Florida primaries, though in the speech he goes out of his way to say Trump is a friend.

Laura Ingraham has supported Trump in various ways for a while and her speech seems pretty full bodied. Kerry Woolard is the General Manager of Trump Winery, so whatever she feels, she has a big incentive to perform support. Harold Hamm is a potential cabinet pick for Trump and very pro-Trump.

Not surprisingly, mentions of the first name alone suggest familiarity. There are various ways we might discount or weight the tf-idf scores for first name vs. last name vs. full name. But for now, the top tf-idf ones do seem to be performing strong support.

Sniff test #2: What about no-naming?

Kareem Abdul-Jabbar never mentions the Democratic nominee in his prepared remarks. Instead, he focuses on Donald Trump and Mike Pence as being deeply problematic because of their penchants for discrimination. I’m not sure what Abdul-Jabbar’s beliefs are, of course, but it does seem plain that in his speech he is more anti-Trump than particularly pro-Clinton. The words with the highest tf-idf scores for him are JeffersonKhantyranny, and discrimination.

The RNC has a lot more no-namers. And they have longer (more problematic) histories with Trump.

  • Joni Ernst’s highest tf-idf words include Iowamy, our, country, and failed
  • Tom Cotton’s highest tf-idf words include wishesinfantrymanArmypunishingvolunteered, and peace
  • Charlie Kirk’s highest tf-idf words include campusesdemocratsparty, and youngest
  • Michael Mukasey’s highest tf-idf words include falselysheemailslaw, and hacked
  • Rick Perry’s highest tf-idf words include battlesTexas, and veterans
  • Dan Sullivan’s highest tf-idf words include Senatewe, and United States

Do these people dislike Trump? I like this headline: “Rick Perry Gives Speech At Trump-Centered Convention, Pretends Donald Trump Doesn’t Exist“. Ernst declined to be considered for Vice President and has tended to avoid talking about Trump or not said great things about him. Tom Cotton has spoken out against Trump’s Muslim ban. Sullivan, like most of the Alaska delegation, was more about supporting the nominee than Trump specifically.

Finally, in May of this year, Charlie Kirk posted an article called “I saw Trump coming, and I chose to ignore it”. I can’t tell you what’s in that article, though, because he’s removed it. But that’s the syntax of a critic not a supporter.

In other words, failure to mention the nominee’s name, minimally means the speaker is concentrating on something other than the nominee. I think it also is reasonable to say that people who don’t really support the nominee much will avoid mentioning them.

What does Paul Ryan do?

Paul Ryan refers to Donald Trump as Donald Trump two times at the RNC convention:

  • In the opening of his speech: “But you’ll find me right there on the rostrum with Vice President Mike Pence and President Donald Trump.”

  • And in the middle: “Only with Donald Trump and Mike Pence do we have a chance at a better way.”

(Ponder-point: how important is it that Donald Trump’s mentions link him to Mike Pence, the “stable” one?)

This is basically even with the amount Ryan mentions the rival nominee. He says Hillary Clinton once, Hillary twice and Clinton twice (“another Clinton”, “The Clinton years”).

Ryan has zero pronouns referring to either Trump or Clinton.

Is this surprising? Ryan’s speech has 1,609 words, making it the tenth longest at the RNC this year (in order: Trump, Pence, Barrack, Ivanka, Gingrich, Geist and Tiegen, Cruz, Christie, Eric, then Ryan).

Only a few people have tf-idf scores for Donald that are lower than Ryan’s: the six people who never use his name at all as well as the duo of Mark Geist/John Tiegen and…Ted Cruz.

There are seven people who don’t use Trump at all (the six above plus Jason Beardsley). Again, only Geist/Tiegen and Ted Cruz have lower tf-idf scores. Geist and Tiegen just say Donald Trump once and Ted Cruz also just says it once:

  • Geist & Tiegen: Someone who will have our backs. Someone who will bring our guys home. Someone who will lead with strength and integrity. That someone is Donald Trump.

  • Ted Cruz: I want to congratulate Donald Trump on winning the nomination last night.

The Geist & Tiegen mention doesn’t really look like a snub so much as they are talking about other stuff (their highest tf-idf words include Glenarm, and Tripoli…they also have 4 rhetorical uses of someone that are really referring to Trump).

The Ted Cruz mention is…well, if you’re reading this you probably already have read about his non-endorsement and know he’s a very Big Anti-fan of Trump. And Paul Ryan’s statistics are very close to his.

Referents_8k_to_10k

You’re going to tell me about Bernie, right?

Sure, okay. Bernie’s speech at the DNC has 14 uses of Hillary Clinton and one Secretary Clinton. That means that Bernie’s tf-idf scores for Hillary are in the top quartile in the DNC and the tf-idf scores for Clinton are just under the median for the DNC speakers.

He’s largely seen as giving a complete and total endorsement of Clinton in his speech, so these numbers check out.

Hey, I want the gory details about the family members

Chelsea, Ivanka, and Eric

If you’re Chelsea, you don’t  call your mom, “Hillary” or “Clinton” or “Hillary Clinton”. It’d be a bit weird, right?  All of her uses of pronouns refer to her mom.

  • The pronoun she is fine (31 uses)
  • So is her (16 uses)
  • my mom
  • my own mother, 1 My wonderful, thoughtful, hilarious mother, 1 a mother (about Hillary) and 5 my mother

Chelsea’s two uses of he are about her son Aidan–there’s no mention of Trump in her introduction of her mom.

Ivanka does something different. Like Chelsea, she never refers to her parent by just his first name or just his last name. But for rhetorical effect, she does refer to Donald Trump seven times and Donald J. Trump once. (If you want to track everything: Ivanka refers to Trump Tower twice and Trump Organization once.) Like Chelsea, all of Ivanka’s pronoun uses are about her parent-the-nominee:

  • Ivanka has 37 uses of he
  • She has 27 uses of his
  • And she has 12 uses of him
  • 18 my father‘s

Ivanka’s 1 her (re: America) and 1 she (re: a woman becoming a mother).

Eric uses Trump twice–once in reference to The Eric Trump Foundation and once to say he has never “been more proud to be a Trump”. All but one of Eric’s pronouns refer to his dad:

  • Eric has 25 uses of he 
  • 18 his
  • him
  • my dad and 1 Dad
  • 19 my father’s

The only one instance that doesn’t refer to his dad is also the only time he uses a female pronoun: “the veteran tuning into this speech from his or her hospital bed”.

Bill and Melania

Spouses have different rights and requirements. Bill Clinton does refer to his wife as Hillary (25 times).

  • She (134 times in that form, 9 she’s, 3 she’ll and 1 she’d)
  • Her (67 times and 2 hers)
  • (No uses of wife)

He also has one use of their shared last name: “after Hillary testified before the education committee and the chairman, a plainspoken farmer, said looks to me like we elected the wrong Clinton”.

Bill Clinton never refers to Trump by name or by pronoun. Bill Clinton’s he‘s are about Gingrich (1), Obama (2), Don Jones (3), an imaginary son for checking on a racist school (1), Franklin Garcia (1), a Law School classmate (1). His him‘s are about a segregationist (1), DeLay (1), Rangel (1). The three his in Bill Clinton’s speech are about Obama (2) and Don Jones (1).

Like Ivanka, Melania refers to Donald Trump by his full name several times. She uses Donald J. Trump twice and Donald Trump once. She refers to the Trump family and a Trump contest (that refers to the campaign to November). She refers to Donald 14 other times. All her pronouns are about him (Melania doesn’t use any female pronouns in her speech):

  • 19 he 
  • 12 his 
  • him
  • my husband‘s

I guess I should wrap up with a note here to reaffirm that speakers have lots of help writing their speeches. That makes it a bit more likely that they will fit certain conventions (e.g., to make a good First Lady speech, copy from a former First Lady speech). This post has been about getting a handle on those hidden rules and seeing how far people deviate from them. We don’t have access to people’s hearts–but we do have access to what they write and say and can compare it to others.

Failed vs. fighting: the linguistic differences between speeches at the RNC and the DNC conventions

1 Aug

We know that Republicans and Democrats talk differently, but what’s the best way to describe these differences? Commentators note the relative darkness of the Republican National Convention and the focus on optimism and higher production quality for the Democratic National Convention. Looking at the words speakers use helps–but you can’t just use simple frequency (for details, check out the methodology section at the bottom).

The major differences

I’ve listed the top 15 words most characteristic of each convention, but I believe they boil down to the difference between “failed” and “fighting”. Consider some of the words that are firmly on the Republican side: governmentfailed, and politicians.

Failure is, of course, negative. You could consider government and politicians to just be neutral nouns, but that’s not how they are used by Republicans. And it seems that Democratic speakers are sensitive to this since they tend to avoid these words–there were 61 uses of government at the RNC but only 10 at the DNC. There were 19 occurrences of politicians at the RNC but only one at the DNC (that’s Obama and he’s using it negatively, too). But there are two ways of looking at this: the Republicans have seized them for negative purposes…and the Democrats have ceded them to anti-government/anti-politics forces on the right.

Meanwhile, fighting emerges as an important Democratic word. It’s not that the count is huge here–40 DNC uses versus the Republicans’ 6. But that it is indicative of a dominant framing. The past tense of this word is also characteristic of the DNC–Hillary Clinton and others are described by the battles they have fought (34 uses at the DNC, 5 at the RNC). This is a contrastive analysis, so it’s worth pointing out that Hillary Clinton has a longer career in politics than Donald Trump and his past work isn’t usually described in terms of battles, with perhaps one major exception.

For the Democrats, joint action also seems to be highly relevant–voting for instance or together, which is used regularly without being in the campaign motto, stronger together (stronger is also more of a DNC word than an RNC word–it scores 0.64 in Democratic relevance). Together was used 95 times at the DNC, only 27 times at the RNC. That said, the phrase stronger together only occurred 13 times at the DNC, once at the RNC. The great again part of Trump’s campaign motto occurred 24 times at the RNC, 6 times at the DNC.

Other differences include:

  • Republicans often invoke Benghazi and terrorism (Benghazi, enemies, Islamic, terrorism, radical)
  • Republicans frame immigration/terror in terms of borders
  • Children of nominees don’t usually have as prominent a role as they did at the RNC: father occurs 56 times in 2016 RNC speeches. That said, it’s the Democrats who focus on kids (61 uses at the DNC, 14 at the RNC–child and children also skew strongly Democratic)
  • Among the issues that Democrats focus on are health care/insurance, issues of social justice and gun control, which barely misses being in the infographic
  • The Democrats tend to be more colloquial–the conventions are about equal in their use of we, but the Democrats are much more likely to use contractions we’ve and we’re as well as she’s and that’s. The Democrats also use got (usually got to) while the Republicans use have (though the strongest phrase for them is, have been).
  • Ben Zimmer points out that researchers have found that the correlates with “high psychological distance”. It may also correlate with more older male speakers. Check out the Language Log for some other thoughts.

RNC_vs_DNC_speeches_2016d

Going beyond words

I could also give you the two-word phrases that pop out, but let me just summarize those.

For the Republicans:

  • Trump will
  • my father
  • our enemies
  • Donald Trump
  • he will
  • American Dream
  • no longer
  • who will
  • great again

For Democrats:

  • she knows
  • fighting for
  • each other
  • she was
  • that’s why
  • First Lady
  • health care
  • when she
  • with her

Methodology and data

This post uses techniques described in Monroe et al (2008), a paper that pursued this as a question of methodology. In their paper, one of the prime examples was how  Democrats and Republicans in Congress talk about issues like abortion.

The main point of Monroe et al is that you need to figure out some way to contrast two categories against each other and some background information–for example, Republican speeches on abortion versus Democratic speeches on abortion, with a background of everything else that Republicans and Democrats talk about. Technically speaking, I’m using weighted log-odds-ratios with informative Dirichlet priors. I call these “relevancy scores” in the infographic.

For the data, I took 50 speeches from this year’s RNC (50,191 words), 45 speeches from this year’s DNC (53,994 words), and 47 speeches from the past (180,719 words). DNC speakers use more words overall and their sentences are longer, too.

For the 2016 convention speeches, I used all of the speeches that the two parties made available via Medium, but for the prime-time speeches, I made an effort to get actual transcripts (removing annotations about applause, laughter and chants from the audience).

For setting “priors”, I used these word counts plus data from the recent and not-so-recent past. The past data is made up of: (a) all nominees’ speeches back to Carter and Reagan in 1980, (b) all the spouses’ convention speeches back to 2000–except for Tipper Gore’s, let me know if you can find a transcript for that, and (c) all the public speeches recorded for every Republican/Democratic presidential candidate between Jan 2016 and the conventions, as provided by The American Presidency Project.

“Donald” is not an important word in prior conventions, so it is characteristic of the current year and the RNC, in particular (218 uses of it in the RNC, 142 in the DNC speeches). The same goes for “Hillary” (300 times in the DNC speeches, 181 in the RNC). And that’s also one of the reasons that “she” appears as a major keyword for the DNC (419 occurrences in the DNC , 142 in the RNC). Clinton is the first nominee of a major party whose pronoun of choice is “she”. But there’s not much special about “he” (234 in DNC, 288 in the RNC). Yes, “he” often refers to Donald Trump in the conventions this year, but it could also refer to all the other nominees that both parties have fielded.

 

 

Which new emoji will be the most popular?

20 Jun

June 21st is the release of Unicode 9, which will feature 72 new emoji–folks at Emojipedia have helpfully put them all together. The question in this blog post is: which ones will turn out to be the most popular? (Note that most people aren’t going to be able to use them immediately–you have to get an update of your phone/browser for them to show up and so will anyone you want to send them to.)

unicode-9-emojis-emojipedia.jpg

Two emoji that won’t become popular are going to be the rifle and the modern pentathlon since it won’t be easy to access them. In May, Apple led an effort against them, so you almost certainly won’t see them in any keyboard even though I believe the code will be in place.

Using past data to predict the next round

In general, you should bet on hearts, faces, and hand gestures. Here are some screenshots from emojitracker.com and EmojiXpress, which show what’s been most popular on Twitter and SMS text messages, respectively.

Screenshot 2016-06-20 13.22.43.png

Top emoji of all time on Twitter by emojitracker.com (the coloring doesn’t mean anything you need to worry about)

Screenshot 2016-06-20 13.24.23.png

Top emoji of all time from emojiXpress, which represents use in SMS text messages

EmojiXpress also helpfully shows which of the newest emoji have been most popular:

Screenshot 2016-06-20 13.26.58.png

The most popular of the newest emoji used on emojiXpress

Note that there was a pretty big campaign for the taco, but emojiXpress has it currently 21st of the emoji that were released last year. So I don’t think that augurs well for those of you who want to predict bacon. I did a quick look at the usage of taco over the last several days and there’s no upward trajectory, it’s plodding along at the rate it has been for the last several months. If anything in the Unicode 8 emoji is trending, it’s probably the scorpion but it’s going to take a while to overtake even the chipmunk.

Prediction: Overall

My prediction for the number one overall spot is the ROFL face because there’s a strong tendency for people to use emoji to express happy states of affairs. My I hope-it’s-not-the-runner-up pick is the black heart.

Screenshot 2016-06-20 13.39.29.png

Prediction: Not-faces

I think it’s likely that the shrug and the face palm are going to have aficionados. And while I like the John Travolta moves on the dancing man, I’d rather live in a world in which we all agree that EVERYONE is a woman-in-a-red-dress 💃 (note that not all platforms show a red dress).

Meanwhile, there’s a cartwheel, which really should be popular, but it’s going to appear in the athletic section so most people will miss it. And they’ll miss water polo, too, which is a shame not just because I used to play but have you seen water polo players?

Even before skin tones were easily available, people using the #blacklivesmatter and #icantbreathe hashtags on Twitter were using a lot of the fist emoji to indicate solidarity and Black Power. Now that skin tones are available people can use hand gestures and other people-emoji that more accurately describe them.

The new batch of hand gestures can be used playfully, positively and politely (handshakes and fist bumps being ways of making contact). I’m not quite sure how having left- and right-facing fists bumps will work. Most emoji are just one way, like you have to run and drive off to the right 🏃🚗🚓. It’ll be neat if people offer a fist bump facing right and then get a reply that has a fist bump facing left to connect.

Back to popularity. I want to vote for the shrug, but I’m afraid that the fact that it looks like it’s going to show a woman’s head by default means that a lot of people won’t use it. But like the dancing woman, we should all use it. The smart money is likely on the raised hand since it can mean so much (stop, high five, etc). But I’m going to wager that folks using and making fun of selfies are going to cause it to take off:

Screenshot 2016-06-20 15.05.49.png

Prediction: Animals

Finally, as much as I like the gorilla, it’s probably not going to win the animal bracket. Animals are an interesting class because they are all nouns. They offer further evidence that emoji aren’t really about substituting for nouns. Instead emoji are usually about emotional stance, identity, and metaphor. The most popular animals include the see-no-evil monkey, which people don’t use to talk about the actual animals 🙈. The cat-faces with heart eyes or tears are also popular 😻. The unicorn is also very popular among the new emoji–and it doesn’t even really exist…but it can convey sparkle magic.

Despite my warning about treating emoji as if they are just noun-pictures, let’s look at how the are used as nouns. For example, the fox is very common to be talked and written about–at least in American English over the last five years. But that’s mainly because of Fox News. My plea: do not use the fox emoji to refer to this organization.

Search term Per million words
a fox 2.84
a deer 2.82
a shark 1.81
a bat (not disambiguated) 1.69
a duck 1.64
a butterfly 1.37
an eagle 1.24
an owl 0.85
a gorilla 0.28
a shrimp 0.28
a lizard/gecko/newt 0.25
a rhino/rhinoceros 0.20
a squid 0.17

Lots of people who aren’t Americans or English-speakers use emoji, of course. The top new animal emoji for Spanish-speakers may be the butterfly, the duck and the fox. For Portuguese speakers, it may be the lizard, the butterfly, the shark, and the eagle. But properly, I should use bigger corpora on those languages and add at least a half dozen more. Even better would be to look at image search results to see which animals people are searching for. 

If this post could put me in touch with Joan Embery, that would be swell. But until she weighs in, I’ll make the claim that butterflies are also the most likely of these animals for people to encounter worldwide since it seems to be distributed everywhere in the world except Antarctica. But I am skeptical of using in-life frequency to predict emoji frequency. So it is because butterflies are widespread symbols of natural beauty that I’m going to pick them as my Most Likely to Succeed in the animal bracket.

Screenshot 2016-06-20 15.37.46

What do you predict?

I think the other two interesting brackets are probably sports and food. Which ones are you picking?

Artificial intelligence in the press and in history!

16 Jun

Over on CrowdFlower’s blog, I’ve got two posts on artificial intelligence.

  • How does the media cover AI?
    • In which I look at 2,000 articles over the last year and a half and talk about the major themes.
  • An AI Springtime
    • Where I take a look at whether we’re just in hyper-hype and how past hype/bust cycles have worked. Also I get to count volcano eruptions and tokens of “disappoint”.
Clusters.png

The major themes in recent press about artificial intelligence

Poetry v. not-poetry

5 Jun

I’ve been training an artificial intelligence system to write poetry and this morning I got interested in what the little parts of syntax and semantics are that preoccupy poets compared to other forms of written language. So I took a heap of poetry and a heap of not-poetry , pulled out the bigrams (two-word phrases) and did some statistics to see what distinguishes poetic writing from non-poetic writing.

Poets are preoccupied by these phrases:

  • Metaphor:
    • like a
    • like the
  • Nature:
    • the sky
    • the sun
    • the wind
    • the moon
    • the earth
    • the dark
    • the river
    • the sea
    • the snow
    • a stone
    • the water
    • the air
  • Self and others:
    • said geryon (this is because there’s a fair amount of Anne Carson in the data and she has a whole book about a character named Geryon)
    • the world 
    • my mother
    • the dead
    • my heart
    • of your
  • Space/time/prepositional phrases:
    • the present
    • in the
    • on my
    • in your
    • in its
    • under the
    • from the
    • in a

Meanwhile, they steer clear of the following phrases, which seem to be better for writing letters, fiction, essays or other kinds of non-fiction. The point of this comparison set was not so much to compare poetry to any particular other genre, but to collect a variety of “not-poetry” to see what poets tend not to use:

  • she had
  • he had
  • had been
  • she was
  • she said
  • it was
  • did not
  • going to
  • that he
  • that she
  • i had
  • there was
  • to her
  • seemed to
  • to do
  • he was
  • (okay, I’m going to stop there)

In other words, poets–or at least these poets–don’t tend to talk much about the past tense. They do, however, orient things spatially, as evidenced by all those prepositional phrases. Not surprisingly, the poetry is a lot more personal, with many more I/my/we/you/each other. And of course like a and like the are prominent, since it’s hard to resist a metaphor.

Notice that there’s a lot of definite articles used in poetry. I think that’s mostly in service of talking about nature and poetical things, though there are twice as many the phrases in the poetry-camp than in the non-poetry camp. Those that are non-poetic and reasonably frequent in both lists are the mostthe hospital, the baby, the american, the time, the fact, and the country.

If you’re curious for some major phrases where there is no difference, let me give you a sample of those: and aagainst theas it, and you do are all examples of phrases that are even in usage betwen poetic and non-poetic writing.

Methods and data

I took 75,678 lines of poetry (537,711 words) and compared them to 15,000 paragraphs of fiction and non-fiction (534,723 words). If you’re curious about methodology, you can read more about it here and here.

The poetry sample is 37 texts from 35 authors. By word count, the top authors in the data here are Lorine Niedecker (12.8%), Wisława Szymborska (8.7%), Jane Shore (6.1%), and Anne Carson (5.5%). Szymborska wrote in Polish but here I’ve included her in English–so you may want to say I’ve included her and/or the translators, Stanislaw Baranczak and Clare Cavanagh.

The non-poetry is randomly sampled “lines” (paragraphs) from 41 texts by 32 authors. The biggest amounts come from Joan Didion (12%), Virginia Woolf (7.7%), Penelope Fitzgerald (6.7%), Rainbow Rowell (5.3%), and Louisa May Alcott (5.2%).

What about women who aren’t white?

The authors in the data are all white women. You will be shocked–shocked!–to hear that it’s harder to get collections of poetry by non-white poets who are women. I currently only have eight poetry collections that fit. You’ll grant me that it would be strange to include Phillis Wheatley who was writing in the 1700s with a bunch of much-more modern writers. But if you think I should go ahead and add in people like Claudia Rankine, Nikki Giovanni, and Maya Angelou, I’m certainly open to that critique.

There’s a lot more data available for non-white novelists who are women, so that’s probably the next step. EXCEPT that one wants to be careful about what their lumping. So I’m unlikely to compare novelists in terms of race unless I have A LOT more data.

That said, diving into the phrases that preoccupy, say, Toni Morrison or Octavia E. Butler compared to other people (or each other!) has some appeal. If you have particular interests or suggestions, I’d be very happy to get them.

A peek inside text message patterns

23 Apr

Is texting ruining your relationships or saving them?

It’s a bit surprising to me when people think technology is inherently good or bad. Even people are rarely good or bad as far as I can tell, though yes they sometimes work wonders and often they do monstrous things. I bet you could really manipulate people with SMS. Or you could be warm and loving.

Texting is a huge and important phenomena–every three months the words sent by SMS equal the number of words ever published in books in the history of humanity (those are the 2013 stats so it’s almost certainly accelerated). But texting is hard to study, so I thought I’d go through my data–the names below are pseudonyms.

In particular, I’m interested in how people stay in touch. When you are meeting with someone face-to-face or even over the phone, context is a lot richer. If you’re together, you can see each other’s faces and body language. You can have joint attention on a stranger passing by or a piece of art on a wall in front of you. Even when you’re on the phone, you hear lots of vocal cues. Emoticons, emoji, animated gifs, shared YouTube videos and the like help add color and context but they aren’t quite the same.

But texting can work asynchronously so even if you aren’t in the same city or can’t pick up the phone you can let someone know that you’re thinking of them. This will be a very basic post, mainly laying out some differences between different individuals and different groups.

About the data

From April 1, 2012 to April 23, 2016 I received 77,271 text messages and sent out 57,186. That represents 924 different people (well, mostly people, I didn’t filter out automated messages but there aren’t very many of these). 226 of these people who have sent me more than 50 messages.

I typically send most of my text messages from 11am to 7pm (the median hour for me is 4pm). This is the same as when people texting me are most active.

My own texts tend to be 7 words/39 characters (top quartile of 14 words, 75 characters; bottom quartile of 3 words, 19 characters). Other people are just slightly less verbose: a median of 6 words/33 characters for people texting me (top quartile of 12 words, 65 characters; bottom quartile of 3 words, 15 characters).

Looking at word lengths, people are at their wordiest during the day til about 4pm, they’re at their least wordy from 10pm on. The time between 4pm and 10pm falls in between these extremes.

Defining contrast sets

Looking over the people I communicate with via text, these groups emerge:

  • The top 10 people I text with are undeniably among the people I’ve been closest with over this time period
  • Among the next 50 people (all have sent over 200 text messages), there are
    • Friends/family who I’m really close to but who don’t text much (they are more phone/in-person people)
    • Romantic interests (a big but fuzzy category since many of these become friends–a bit over half?)
    • Other friends
    • Work folks (we mostly use the Slack chat tool to communicate, not SMS)
  • These categories repeat among the “under 200 text message” senders, though I also start to see people best categorized as “acquaintances”. That’s not an appropriate way to talk about any of the people who have sent me more than 200 messages.

Many of the people I correspond with the most are also in group threads, which means it’s a lot easier to count texts from them to me than from me to them. So for the moment, I’m going to focus on what’s incoming.

In general, these particular numbers don’t do a lot to differentiate people. That is, the prime texting time for everyone is basically in the late afternoon, especially if we consider time zones. The more romantic of relationships do skew a bit to the later side.

Perhaps the main thing to note is that you’ve got some very wordy people like Amber and Jezebel–who are trained therapists for what it’s worth–and some less wordy people like Hannibal who loves emoji and Rusty who is known as laconic in real life, too.

Another note about Rusty, Hannibal, and Sigmund: these are among the people I hang out with the most, so many of these texts are also coordination messages, which don’t usually need to be very long they often use phrases like at <location>on bike, and heading your way.

A few other notes:

  • Apparently romantic relationships require fewer words–they tend to be driving to real-life contact, so perhaps that is what’s going on.
  • The super-close friends who don’t really seem to like texting have fewer texts (by definition) but they are also a bit shorter. The people in this category tend not to be where I live–most of them live out of state and one travels a lot.
  • “Other friends”–a group of people who send a fair number of texts (200+) but who, qualitatively, I just wouldn’t put quite in the same level as a Hannibal or Rusty or Sigmund, seem to be a bit wordier. This may suggest that more catching up is happening in text.
  • Qualitatively, the “other friend” group does not particularly use other means to stay in touch–these folks do not use, say, Facebook likes, to have another point of contact.
  • Sometimes people talk about breadcrumbing as a way of lightly staying in contact. My sense is that most of my Facebook likes come from people who I text with A LOT or people I rarely/never text with. That middle-group of “we text some” doesn’t use pocket dials or Facebook likes or retweets to say, “Hey, I’m paying light attention to you.”

More soon!

Name To Me Median hour sent Time ranges Median word count Word count ranges
Bobo 9,713 3pm 11am to 7pm 7 3 to 13 words
Hannibal 5,858 5pm 1pm to 8pm 5 2 to 11 words
Rusty 4,529 4pm noon to 7pm 5 2 to 10 words
Amber 2,800 3pm 11am to 5pm 12 6 to 25 words
Jezebel 2,043 4pm 10am to 5pm (she lives in CST) 9 5 to 19 words
Swayze 2,042 6pm noon to 9pm 7 3 to 13 words
Sigmund 2,030 4pm noon to 7pm 6 2 to 12 words
Falcon 1,411 2pm 10am to 6pm (she lives in CST) 7 3 to 14 words
Allistar 1,356 3pm 11am to 7pm 6 2 to 12 words
Urs 1,157 3pm 11am to 6pm 7 3 to 13 words
Low-texting super-close people (200+ texts) 1,865 2pm 11am to 6pm 7 3 to 13 words
Other friends (200+ texts) 5,479 3pm 11am to 7pm 8 4 to 15 words
Other romantic (200+ texts) 10,152 4pm noon to 8pm 5 2 to 9 words
Everyone else 50-200 text messages 16,248 3pm 11am to 7pm 6 3 to 12 words

 

 

Computing with affective lexicons: Dan Jurafsky at the Bay Area NLP Meetup

2 Feb

First things first: Dating advice

When you collect the right data and get it labeled, you can find really interesting things. For example, record over a thousand 4-minute speed dates and have the participants rate each interaction for how flirtatious they were and how flirtatious their partner was. Key finding: You thinking that someone is flirting doesn’t correlate to them saying they are flirting. It correlates with YOU saying that YOU are flirting.

Getting to meaning

The world is full of concrete things like bananas and bathrobes. But people do a lot more than just refer to objects in the world when they are speaking and writing and speed dating. They express emotions, they draw in their audiences or push them away, they share opinions about things and ideas in the world.

The most recent NLP Meetup was a tutorial about this by Stanford computational linguist (and Idibon advisor), Dan Jurafsky. You should go look at the slides, but it covered three useful areas: 1) how to think about affective stances, 2) which resources are out there ready for use, and 3) how to use computational linguistics methods to develop new resources, with examples from restaurant reviews and speed dating. Dan’s got a book on the language of food you should go check out—our post on it includes a video of him discussing those findings and methods in detail.

nlpmeetup

Emotion detection

There are lots of signals that convey emotion—speech rate, facial expressions, pause length, darting eyes. Like Dan’s talk, I’ll focus on words.

The first set of resources Dan presented were lists of words and phrases along the simple dimensions of positive/negative/neutral. There are fewer resources that do that by emotion.

Psychologists often talk about there being six basic emotions—Pixar even made them into the characters for Inside Out. But in practice, psychologists have had a hard time keeping the list that small. In a review of 21 different theories of basic emotions a few years ago, I counted 51 different emotions posited as basic. The average number of basic emotions posited across the 21 studies is 9. Here are the most common:

  •      Anger/rage/hostility (18)
  •      Fear/fright/terror (17)
  •      Joy/happiness/elation/enjoyment (14)
  •      Sadness/sorrow/distress/dejection (14)
  •      Disgust (12)
  •      Shame (9)
  •      Love/tender emotion (8)
  •      Anxiety/worry (7)
  •      Surprise (7)
  •      Guilt (6)

There are definitely cases where emotions are more actionable than plain positive/negative/neutral sentiment. But even if there are certain fundamental emotions, they won’t be useful if they are unrepresented in the contexts you care about. For example, a customer care center can make use of emotion detection, but the emotional universe of customer support is really more of disappointment vs. resignation vs. cold anger vs. hot fury vs. relief. Understanding if someone is actually full of joy by the end of a call is useful, but it doesn’t help in routing in the way that understanding level-of-irritation can.

This connects nicely with a theme in Dan’s talk—building statistical models that are based on the categories you care about. (If you want to know about Idibon’s take on this, check out our blog post on adaptive learning.)

Priors: The most important method in the talk

We know that conservatives and progressives talk about issues differently. But a lot of statistical methods for distinguishing Group A from Group B result in uninformative lists, either dominated by frequent words (the and to tell you a little but not much) or very rare words (how informative is it if we say argle-bargle is a conservative word just because Antonin Scalia used it once?).

Computational linguistics comes down to a lot of counting. A “prior” is something that lets you incorporate background information. Check out Dan’s slides 62 and 63, but don’t let the equations frighten you. Mark Liberman’s post on Obama’s language may also help.

Like Dan and Mark, one of my favorite papers on priors is Monroe, Colaresi and Quinn (2008). It’s useful because it walks through so many (inferior) methods before presenting priors and what they get you. For example, they show that in reproductive rights debates in Congress, Democratic representatives disproportionately use women, woman, right, decision, her, doctor, while Republicans use words like babies, kill, procedure, abort, and mother. These very different framings make sense and are interpretable. Other results aren’t nearly as clean.

For more background on priors, you might want to check out the current draft of the update for Jurafsky & Martin’s Speech and Language Processing, which is one of the most widely used textbooks for NLP. Check out the chapter on classification and the one on building affective lexicons.

A few other tricks

Dan mentioned a few techniques that are simple and surprisingly effective. For example, building a system that really understands negation is very difficult. But you get a long way by just detecting linguistic negation (not, n’t, never) and then doing something that amounts to flipping any positive words that follow, up to a comma or period. This method will get you the right classification for something like It’s not a hilarious or thrilling film, but it is majestic. You detect that not is a negative word and therefore treat hilarious and thrilling as negative, too. Otherwise you’d think this mostly negative sentence was mostly positive.

Another clever technique comes from Peter Turney, who wanted to get the semantic orientation of phrases in reviews of cars, banks, movies, and travel destinations: what phrases point to a thumbs up versus a thumbs down?

Knowing that things like “preposition + article” (on the, at a) don’t do much affective work, he came up with 5 part-of-speech patterns that basically gave him meaningful adjectives or adverbs with a bit of context (see Dan’s slide 41).

Every heuristic has its blindspots and certainly phrases like mic drop, swipe right, and on fleek are opinion-bearing phrases. Because of their parts-of-speech, these would be excluded from Turney’s calculations. But the Turney algorithm does find things like very handy and lesser evil. And as Dan said, “The recovery of virtual monopoly as negative is pretty cool.”

Getting at what matters

How many retweets or Facebook likes will Super Bowl campaigns get? How many minutes was a movie, how much did an appetizer cost, how tall was your date?

These are all, arguably, objective facts that you can measure. But there is another way of thinking about them: they are entirely superficial aspects—and therefore not even really about the ad/movie/restaurant/date at all.

By contrast, how did sexism get combatted in last year’s Always #likeagirl campaign? A feeling of wonder in a movie, the awfulness of service, the engagement from your date: these are subjective matters but they can be assessed and reveal deeper insights than the surface facts that are easiest to count. Natural Language Processing is still about counting, but it opens the possibility to count what counts.

Conspiracy, complaints, and fraud: The language of reasons

10 Nov

Three separate threads have been whirling around my head for the last few months, so I was glad to have the opportunity to connect them a few weeks ago at UC Merced.

Thread #1: Fraud

Fraud is a big deal–the Association of Certified Fraud Examiners places the amount of global fraud loss at $3.7 trillion per year.

If you want to detect fraud, you can’t just look for people writing, “I am committing fraud”. Instead, you look for evidence of the fraud diamond: opportunity, pressure, capability, and the focus of my talk– rationalization.

But one of the things that I’ve been thinking about is: how do people rationalize? That is to say, how do they give reasons to themselves and others to make something okay? I like Karen Horney’s words: “Rationalization may be defined as self-deception by reasoning.”

Thread #2: Customer Complaints

Last week, I wrote a bit about how people use intensifiers when they are filing complaints. Another thing that is prominent in complaint-giving is reasoning. 25% of customer complaints logged with the Consumer Financial Protection Bureau have the word because in them. Here’s an example of the basic structure of because–in English, you can swap the order, but in both speech and writing, people almost always put the result before the cause:

  • Result: We strongly suggest someone look into Citimortgage’s business practices,
  • Cause: because at best they are completely incompetent, and at worst they are committing acts of fraud

In these narratives of what happened, people give reasons for their actions and feelings, but they also attribute reasons to banks and other financial institutions. Reason-giving is bound up in explaining the ways in which customers have been affected and how things should be remedied.

Thread #3: Conspiracy Theorists

Okay, this one is mostly in here because it’s fun.

Towards the end of the summer, two Idibonites started looking at what it is, linguistically, that makes people sound rational versus paranoid. We’re not ready to release our “statistical model of paranoia” yet, but one of the things Jana and Charissa have found has to do with how people give reasons. About 7.8% of /r/conspiracy posts have the word because in them. In the previous section, I noted that consumer complaints about banks had a rate of 25%. So 7.8% is a lot less than that–but if you look across all the Reddit forums, the rate of because in /r/conspiracy puts it in the top quartile of most-because-y. (See below for the ones that get up to 16%.)

Some favorite findings

You can watch the presentation or flip through the slides, but here are probably my favorite points.

  • When a customer complaint about a bank involves a “because”, it’s a much longer complaint. This seems to also be a feature within Reddit.
  • Because is associated with highly emotional content in many domains—ranging from soap opera dialog to speeches in the British Parliament.  Reasoning isn’t separate from emotion, it’s built on it.
  • Becauses are much more common in conversations about accounts and mortgages than credit reporting or debt collection.
  • The subreddits where people give the most reasons (highest percentages of because) include those that are specifically about debating (/r/changemyview, /r/DebateaChristian), those that tackle gender and sexism (/r/AgainstGamerGate), and those that have to do with romance (/r/relationship_advice, /r/relationships).
  • Among /r/conspiracy authors, the biggest because users tend to talk about JFK, 9/11, aliens, and space.

You can check out the video recording of the full presentation here:

Here are the slides from the presentation:

Intensity in consumer complaints about banks

30 Oct

Analyzing the language used in consumer complaints tells you about both the topics that people are complaining about and their severity. An appreciation for what people are saying can help you build better products, save valuable customers, and fix problems earlier. In the case of financial service complaints, customer language can also expose what’s known in regulation circles as “unfair, deceptive, or abusive acts or practices” (UDAAP). There were $2.5b in UDAAP settlements in 2014, up 30% from 2013.

In this post we take one small but revealing aspect of language: intensifiers. There are a lot of ways that people show intensity–in speech, they increase their volume, in text they may use ALL CAPS or rows of exclamation points. But right now let’s look at words that are traditionally called “intensifiers”–like very and really. Explicit accusations of deception often come with intensifiers–but as is often the case with language, a word that accompanies explicit accusations also helps pinpoint implicit ones. And outside of accusations of deception, intensifiers also help identify highly emotional content.

In daily conversation, people usually use intensifiers about positive things. People talk about really enjoying things and how they are really neat. They say thanks very much and that things are very interesting. That said, people’s everyday speech also has a lot of very important and very difficult. In Spanish speech, the words that usually occur with muy are bien, poco, importante, and difícil. These are common in Portuguese, too–muito (‘very’) also goes with bem (‘well’), importante, and difícil. Regardless of your native language, if you reflect on where  intensifiers appear, you’ll see they aren’t just used to intensify verbs and adjectives–they’re used to intensify a speaker/author’s commitment to a claim.

Take a look at how they are used in customer complaints lodged against financial institutions. Looking closely at intensifiers identifies issues with customer service as well as unfair, deceptive, and abusive acts and practices:

Chase’s lack of appropriate and timely processing of my family’s request is literally forcing us into foreclosure but I struggle to keep my mortgage current b/c of the adverse professional ramifications.

Please help me they prey on people that are poor and withouta car I cant work. I have gotten soooo mad and it is not good for my health

I explained to him I want to pay my loan I just can not afford the xxxx withdraws of $240.00 bi-weekly, he was extremely rude and ridiculed me saying he could not help me with anything until my account made it his way

tHey are now out of business filed bankruptcy sold their portfolio to a third party and cant be found. PRO-COLLECT IS ILLEGALLY TRYING TO COLLECT ON ILLEGAL BILLING STATEMENTS THAT ARE TOTALLY FALSE AND WITHOUT MERIT.

Overall, about 30% of complaints against financial firms include intensifiers. Reddit provides an interesting contrast set because they have tens of thousands of forums focused on very different matters. The median percentage of posts-with-intensifiers in Reddit forums is 15%.  Only 5% of all of Reddit forums have as many intensifiers as complaints about banks–for Reddit, these are highly emotional topics having to do with problems in romantic relationships and debates on religion or gender. In the financial service complaints data, the very highest percentage of intensifiers is in Mortgages–that’s when people are talking about their families losing their homes, so it’s no wonder that it’s so high.

We can get more granular than Mortgages. Across all different kinds of financial products, let’s look at what sort of issues customers use intensifiers with disproportionately:

  • Can’t repay my loan
  • Loan modification, collection, foreclosure
  • Application, originator, mortgage broker
  • Dealing with my lender or servicer
  • Problems when you are unable to pay
  • Problems caused by my funds being low
  • Communication tactics

In other words, people are using intensifiers in highly-fraught situations when their homes or possessions are on the line, as well as when they feel like there is problematic communication. This recalls one of the major findings about one-star ratings in Yelp reviews–they are rarely about food, they are about awful service.

For a sense of contrast, here are categories where consumers use fewer intensifiers than we’d expect if everything were just random:

  • Incorrect information on credit report
  • Improper use of credit report
  • Unable to get credit report/credit score
  • Credit reporting company’s investigation

This also means that while people issue complaints to credit bureaus, they don’t use that many intensifiers–so complaints against TransUnion, Experian, and Equifax have low rates of intensifiers. The highest rates of intensifiers in complaints go with companies like Green Tree Servicing, Enhanced Recovery Company, Ocwen, NationStar Mortgage, and Wells Fargo. That’s particular because while bad credit ratings definitely affect people, it’s not as intense an emotional situation as a home being on the line. Automated processes are also seen differently than direct contact with humans (loan officers, etc).

Intensifiers are just a tiny aspect of assessing risk. Ideally, you want a system that considers all kinds of words and phrases–actually, you want to detect these automatically and give them weights based on the statistical strength of their signal. To learn more about the ways that adaptive machine intelligence works to do this, check out this blog post or our use cases page

Humans can barely understand emojis. Will machines do any better?

22 Sep

The human skull has 14 facial bones and 35 muscles wrapping around these bones. That anatomy works together to form everything from grimaces, to grins, to mouths agape. Beyond the face, there are all kinds of cues that you can use to understand someone: voice contours, body language, and eye contact, to name a few.

All this context disappears when we switch to text. Emojis and emoticons help fill in the gap. They let us express a stance; for instance, “Ok” can connote “I’m a little bothered,” but “Ok :)” means the situation really is okay. As a special bonus, in addition to some 130 available facial expressions, emojis let us style ourselves into sleepy pandas, sparkle tigers, and thousands of otherwise-impossible contortions.

While plasticity is part of what makes emojis fun to use, it’s also what can make them complex to understand. But, as more communication migrates to digital avenues—think about how often you text versus how often you make a phone call—deciphering our 21st-century shorthand is becoming essential.

Screen Shot 2015-09-22 at 9.27.59 AM

Continue reading full article: Check out the full article from Qualcomm!