Which new emoji will be the most popular?

20 Jun

June 21st is the release of Unicode 9, which will feature 72 new emoji–folks at Emojipedia have helpfully put them all together. The question in this blog post is: which ones will turn out to be the most popular? (Note that most people aren’t going to be able to use them immediately–you have to get an update of your phone/browser for them to show up and so will anyone you want to send them to.)


Two emoji that won’t become popular are going to be the rifle and the modern pentathlon since it won’t be easy to access them. In May, Apple led an effort against them, so you almost certainly won’t see them in any keyboard even though I believe the code will be in place.

Using past data to predict the next round

In general, you should bet on hearts, faces, and hand gestures. Here are some screenshots from emojitracker.com and EmojiXpress, which show what’s been most popular on Twitter and SMS text messages, respectively.

Screenshot 2016-06-20 13.22.43.png

Top emoji of all time on Twitter by emojitracker.com (the coloring doesn’t mean anything you need to worry about)

Screenshot 2016-06-20 13.24.23.png

Top emoji of all time from emojiXpress, which represents use in SMS text messages

EmojiXpress also helpfully shows which of the newest emoji have been most popular:

Screenshot 2016-06-20 13.26.58.png

The most popular of the newest emoji used on emojiXpress

Note that there was a pretty big campaign for the taco, but emojiXpress has it currently 21st of the emoji that were released last year. So I don’t think that augurs well for those of you who want to predict bacon. I did a quick look at the usage of taco over the last several days and there’s no upward trajectory, it’s plodding along at the rate it has been for the last several months. If anything in the Unicode 8 emoji is trending, it’s probably the scorpion but it’s going to take a while to overtake even the chipmunk.

Prediction: Overall

My prediction for the number one overall spot is the ROFL face because there’s a strong tendency for people to use emoji to express happy states of affairs. My I hope-it’s-not-the-runner-up pick is the black heart.

Screenshot 2016-06-20 13.39.29.png

Prediction: Not-faces

I think it’s likely that the shrug and the face palm are going to have aficionados. And while I like the John Travolta moves on the dancing man, I’d rather live in a world in which we all agree that EVERYONE is a woman-in-a-red-dress 💃 (note that not all platforms show a red dress).

Meanwhile, there’s a cartwheel, which really should be popular, but it’s going to appear in the athletic section so most people will miss it. And they’ll miss water polo, too, which is a shame not just because I used to play but have you seen water polo players?

Even before skin tones were easily available, people using the #blacklivesmatter and #icantbreathe hashtags on Twitter were using a lot of the fist emoji to indicate solidarity and Black Power. Now that skin tones are available people can use hand gestures and other people-emoji that more accurately describe them.

The new batch of hand gestures can be used playfully, positively and politely (handshakes and fist bumps being ways of making contact). I’m not quite sure how having left- and right-facing fists bumps will work. Most emoji are just one way, like you have to run and drive off to the right 🏃🚗🚓. It’ll be neat if people offer a fist bump facing right and then get a reply that has a fist bump facing left to connect.

Back to popularity. I want to vote for the shrug, but I’m afraid that the fact that it looks like it’s going to show a woman’s head by default means that a lot of people won’t use it. But like the dancing woman, we should all use it. The smart money is likely on the raised hand since it can mean so much (stop, high five, etc). But I’m going to wager that folks using and making fun of selfies are going to cause it to take off:

Screenshot 2016-06-20 15.05.49.png

Prediction: Animals

Finally, as much as I like the gorilla, it’s probably not going to win the animal bracket. Animals are an interesting class because they are all nouns. They offer further evidence that emoji aren’t really about substituting for nouns. Instead emoji are usually about emotional stance, identity, and metaphor. The most popular animals include the see-no-evil monkey, which people don’t use to talk about the actual animals 🙈. The cat-faces with heart eyes or tears are also popular 😻. The unicorn is also very popular among the new emoji–and it doesn’t even really exist…but it can convey sparkle magic.

Despite my warning about treating emoji as if they are just noun-pictures, let’s look at how the are used as nouns. For example, the fox is very common to be talked and written about–at least in American English over the last five years. But that’s mainly because of Fox News. My plea: do not use the fox emoji to refer to this organization.

Search term Per million words
a fox 2.84
a deer 2.82
a shark 1.81
a bat (not disambiguated) 1.69
a duck 1.64
a butterfly 1.37
an eagle 1.24
an owl 0.85
a gorilla 0.28
a shrimp 0.28
a lizard/gecko/newt 0.25
a rhino/rhinoceros 0.20
a squid 0.17

Lots of people who aren’t Americans or English-speakers use emoji, of course. The top new animal emoji for Spanish-speakers may be the butterfly, the duck and the fox. For Portuguese speakers, it may be the lizard, the butterfly, the shark, and the eagle. But properly, I should use bigger corpora on those languages and add at least a half dozen more. Even better would be to look at image search results to see which animals people are searching for. 

If this post could put me in touch with Joan Embery, that would be swell. But until she weighs in, I’ll make the claim that butterflies are also the most likely of these animals for people to encounter worldwide since it seems to be distributed everywhere in the world except Antarctica. But I am skeptical of using in-life frequency to predict emoji frequency. So it is because butterflies are widespread symbols of natural beauty that I’m going to pick them as my Most Likely to Succeed in the animal bracket.

Screenshot 2016-06-20 15.37.46

What do you predict?

I think the other two interesting brackets are probably sports and food. Which ones are you picking?

Artificial intelligence in the press and in history!

16 Jun

Over on CrowdFlower’s blog, I’ve got two posts on artificial intelligence.

  • How does the media cover AI?
    • In which I look at 2,000 articles over the last year and a half and talk about the major themes.
  • An AI Springtime
    • Where I take a look at whether we’re just in hyper-hype and how past hype/bust cycles have worked. Also I get to count volcano eruptions and tokens of “disappoint”.

The major themes in recent press about artificial intelligence

Poetry v. not-poetry

5 Jun

I’ve been training an artificial intelligence system to write poetry and this morning I got interested in what the little parts of syntax and semantics are that preoccupy poets compared to other forms of written language. So I took a heap of poetry and a heap of not-poetry , pulled out the bigrams (two-word phrases) and did some statistics to see what distinguishes poetic writing from non-poetic writing.

Poets are preoccupied by these phrases:

  • Metaphor:
    • like a
    • like the
  • Nature:
    • the sky
    • the sun
    • the wind
    • the moon
    • the earth
    • the dark
    • the river
    • the sea
    • the snow
    • a stone
    • the water
    • the air
  • Self and others:
    • said geryon (this is because there’s a fair amount of Anne Carson in the data and she has a whole book about a character named Geryon)
    • the world 
    • my mother
    • the dead
    • my heart
    • of your
  • Space/time/prepositional phrases:
    • the present
    • in the
    • on my
    • in your
    • in its
    • under the
    • from the
    • in a

Meanwhile, they steer clear of the following phrases, which seem to be better for writing letters, fiction, essays or other kinds of non-fiction. The point of this comparison set was not so much to compare poetry to any particular other genre, but to collect a variety of “not-poetry” to see what poets tend not to use:

  • she had
  • he had
  • had been
  • she was
  • she said
  • it was
  • did not
  • going to
  • that he
  • that she
  • i had
  • there was
  • to her
  • seemed to
  • to do
  • he was
  • (okay, I’m going to stop there)

In other words, poets–or at least these poets–don’t tend to talk much about the past tense. They do, however, orient things spatially, as evidenced by all those prepositional phrases. Not surprisingly, the poetry is a lot more personal, with many more I/my/we/you/each other. And of course like a and like the are prominent, since it’s hard to resist a metaphor.

Notice that there’s a lot of definite articles used in poetry. I think that’s mostly in service of talking about nature and poetical things, though there are twice as many the phrases in the poetry-camp than in the non-poetry camp. Those that are non-poetic and reasonably frequent in both lists are the mostthe hospital, the baby, the american, the time, the fact, and the country.

If you’re curious for some major phrases where there is no difference, let me give you a sample of those: and aagainst theas it, and you do are all examples of phrases that are even in usage betwen poetic and non-poetic writing.

Methods and data

I took 75,678 lines of poetry (537,711 words) and compared them to 15,000 paragraphs of fiction and non-fiction (534,723 words). If you’re curious about methodology, you can read more about it here and here.

The poetry sample is 37 texts from 35 authors. By word count, the top authors in the data here are Lorine Niedecker (12.8%), Wisława Szymborska (8.7%), Jane Shore (6.1%), and Anne Carson (5.5%). Szymborska wrote in Polish but here I’ve included her in English–so you may want to say I’ve included her and/or the translators, Stanislaw Baranczak and Clare Cavanagh.

The non-poetry is randomly sampled “lines” (paragraphs) from 41 texts by 32 authors. The biggest amounts come from Joan Didion (12%), Virginia Woolf (7.7%), Penelope Fitzgerald (6.7%), Rainbow Rowell (5.3%), and Louisa May Alcott (5.2%).

What about women who aren’t white?

The authors in the data are all white women. You will be shocked–shocked!–to hear that it’s harder to get collections of poetry by non-white poets who are women. I currently only have eight poetry collections that fit. You’ll grant me that it would be strange to include Phillis Wheatley who was writing in the 1700s with a bunch of much-more modern writers. But if you think I should go ahead and add in people like Claudia Rankine, Nikki Giovanni, and Maya Angelou, I’m certainly open to that critique.

There’s a lot more data available for non-white novelists who are women, so that’s probably the next step. EXCEPT that one wants to be careful about what their lumping. So I’m unlikely to compare novelists in terms of race unless I have A LOT more data.

That said, diving into the phrases that preoccupy, say, Toni Morrison or Octavia E. Butler compared to other people (or each other!) has some appeal. If you have particular interests or suggestions, I’d be very happy to get them.

A peek inside text message patterns

23 Apr

Is texting ruining your relationships or saving them?

It’s a bit surprising to me when people think technology is inherently good or bad. Even people are rarely good or bad as far as I can tell, though yes they sometimes work wonders and often they do monstrous things. I bet you could really manipulate people with SMS. Or you could be warm and loving.

Texting is a huge and important phenomena–every three months the words sent by SMS equal the number of words ever published in books in the history of humanity (those are the 2013 stats so it’s almost certainly accelerated). But texting is hard to study, so I thought I’d go through my data–the names below are pseudonyms.

In particular, I’m interested in how people stay in touch. When you are meeting with someone face-to-face or even over the phone, context is a lot richer. If you’re together, you can see each other’s faces and body language. You can have joint attention on a stranger passing by or a piece of art on a wall in front of you. Even when you’re on the phone, you hear lots of vocal cues. Emoticons, emoji, animated gifs, shared YouTube videos and the like help add color and context but they aren’t quite the same.

But texting can work asynchronously so even if you aren’t in the same city or can’t pick up the phone you can let someone know that you’re thinking of them. This will be a very basic post, mainly laying out some differences between different individuals and different groups.

About the data

From April 1, 2012 to April 23, 2016 I received 77,271 text messages and sent out 57,186. That represents 924 different people (well, mostly people, I didn’t filter out automated messages but there aren’t very many of these). 226 of these people who have sent me more than 50 messages.

I typically send most of my text messages from 11am to 7pm (the median hour for me is 4pm). This is the same as when people texting me are most active.

My own texts tend to be 7 words/39 characters (top quartile of 14 words, 75 characters; bottom quartile of 3 words, 19 characters). Other people are just slightly less verbose: a median of 6 words/33 characters for people texting me (top quartile of 12 words, 65 characters; bottom quartile of 3 words, 15 characters).

Looking at word lengths, people are at their wordiest during the day til about 4pm, they’re at their least wordy from 10pm on. The time between 4pm and 10pm falls in between these extremes.

Defining contrast sets

Looking over the people I communicate with via text, these groups emerge:

  • The top 10 people I text with are undeniably among the people I’ve been closest with over this time period
  • Among the next 50 people (all have sent over 200 text messages), there are
    • Friends/family who I’m really close to but who don’t text much (they are more phone/in-person people)
    • Romantic interests (a big but fuzzy category since many of these become friends–a bit over half?)
    • Other friends
    • Work folks (we mostly use the Slack chat tool to communicate, not SMS)
  • These categories repeat among the “under 200 text message” senders, though I also start to see people best categorized as “acquaintances”. That’s not an appropriate way to talk about any of the people who have sent me more than 200 messages.

Many of the people I correspond with the most are also in group threads, which means it’s a lot easier to count texts from them to me than from me to them. So for the moment, I’m going to focus on what’s incoming.

In general, these particular numbers don’t do a lot to differentiate people. That is, the prime texting time for everyone is basically in the late afternoon, especially if we consider time zones. The more romantic of relationships do skew a bit to the later side.

Perhaps the main thing to note is that you’ve got some very wordy people like Amber and Jezebel–who are trained therapists for what it’s worth–and some less wordy people like Hannibal who loves emoji and Rusty who is known as laconic in real life, too.

Another note about Rusty, Hannibal, and Sigmund: these are among the people I hang out with the most, so many of these texts are also coordination messages, which don’t usually need to be very long they often use phrases like at <location>on bike, and heading your way.

A few other notes:

  • Apparently romantic relationships require fewer words–they tend to be driving to real-life contact, so perhaps that is what’s going on.
  • The super-close friends who don’t really seem to like texting have fewer texts (by definition) but they are also a bit shorter. The people in this category tend not to be where I live–most of them live out of state and one travels a lot.
  • “Other friends”–a group of people who send a fair number of texts (200+) but who, qualitatively, I just wouldn’t put quite in the same level as a Hannibal or Rusty or Sigmund, seem to be a bit wordier. This may suggest that more catching up is happening in text.
  • Qualitatively, the “other friend” group does not particularly use other means to stay in touch–these folks do not use, say, Facebook likes, to have another point of contact.
  • Sometimes people talk about breadcrumbing as a way of lightly staying in contact. My sense is that most of my Facebook likes come from people who I text with A LOT or people I rarely/never text with. That middle-group of “we text some” doesn’t use pocket dials or Facebook likes or retweets to say, “Hey, I’m paying light attention to you.”

More soon!

Name To Me Median hour sent Time ranges Median word count Word count ranges
Bobo 9,713 3pm 11am to 7pm 7 3 to 13 words
Hannibal 5,858 5pm 1pm to 8pm 5 2 to 11 words
Rusty 4,529 4pm noon to 7pm 5 2 to 10 words
Amber 2,800 3pm 11am to 5pm 12 6 to 25 words
Jezebel 2,043 4pm 10am to 5pm (she lives in CST) 9 5 to 19 words
Swayze 2,042 6pm noon to 9pm 7 3 to 13 words
Sigmund 2,030 4pm noon to 7pm 6 2 to 12 words
Falcon 1,411 2pm 10am to 6pm (she lives in CST) 7 3 to 14 words
Allistar 1,356 3pm 11am to 7pm 6 2 to 12 words
Urs 1,157 3pm 11am to 6pm 7 3 to 13 words
Low-texting super-close people (200+ texts) 1,865 2pm 11am to 6pm 7 3 to 13 words
Other friends (200+ texts) 5,479 3pm 11am to 7pm 8 4 to 15 words
Other romantic (200+ texts) 10,152 4pm noon to 8pm 5 2 to 9 words
Everyone else 50-200 text messages 16,248 3pm 11am to 7pm 6 3 to 12 words



Computing with affective lexicons: Dan Jurafsky at the Bay Area NLP Meetup

2 Feb

First things first: Dating advice

When you collect the right data and get it labeled, you can find really interesting things. For example, record over a thousand 4-minute speed dates and have the participants rate each interaction for how flirtatious they were and how flirtatious their partner was. Key finding: You thinking that someone is flirting doesn’t correlate to them saying they are flirting. It correlates with YOU saying that YOU are flirting.

Getting to meaning

The world is full of concrete things like bananas and bathrobes. But people do a lot more than just refer to objects in the world when they are speaking and writing and speed dating. They express emotions, they draw in their audiences or push them away, they share opinions about things and ideas in the world.

The most recent NLP Meetup was a tutorial about this by Stanford computational linguist (and Idibon advisor), Dan Jurafsky. You should go look at the slides, but it covered three useful areas: 1) how to think about affective stances, 2) which resources are out there ready for use, and 3) how to use computational linguistics methods to develop new resources, with examples from restaurant reviews and speed dating. Dan’s got a book on the language of food you should go check out—our post on it includes a video of him discussing those findings and methods in detail.


Emotion detection

There are lots of signals that convey emotion—speech rate, facial expressions, pause length, darting eyes. Like Dan’s talk, I’ll focus on words.

The first set of resources Dan presented were lists of words and phrases along the simple dimensions of positive/negative/neutral. There are fewer resources that do that by emotion.

Psychologists often talk about there being six basic emotions—Pixar even made them into the characters for Inside Out. But in practice, psychologists have had a hard time keeping the list that small. In a review of 21 different theories of basic emotions a few years ago, I counted 51 different emotions posited as basic. The average number of basic emotions posited across the 21 studies is 9. Here are the most common:

  •      Anger/rage/hostility (18)
  •      Fear/fright/terror (17)
  •      Joy/happiness/elation/enjoyment (14)
  •      Sadness/sorrow/distress/dejection (14)
  •      Disgust (12)
  •      Shame (9)
  •      Love/tender emotion (8)
  •      Anxiety/worry (7)
  •      Surprise (7)
  •      Guilt (6)

There are definitely cases where emotions are more actionable than plain positive/negative/neutral sentiment. But even if there are certain fundamental emotions, they won’t be useful if they are unrepresented in the contexts you care about. For example, a customer care center can make use of emotion detection, but the emotional universe of customer support is really more of disappointment vs. resignation vs. cold anger vs. hot fury vs. relief. Understanding if someone is actually full of joy by the end of a call is useful, but it doesn’t help in routing in the way that understanding level-of-irritation can.

This connects nicely with a theme in Dan’s talk—building statistical models that are based on the categories you care about. (If you want to know about Idibon’s take on this, check out our blog post on adaptive learning.)

Priors: The most important method in the talk

We know that conservatives and progressives talk about issues differently. But a lot of statistical methods for distinguishing Group A from Group B result in uninformative lists, either dominated by frequent words (the and to tell you a little but not much) or very rare words (how informative is it if we say argle-bargle is a conservative word just because Antonin Scalia used it once?).

Computational linguistics comes down to a lot of counting. A “prior” is something that lets you incorporate background information. Check out Dan’s slides 62 and 63, but don’t let the equations frighten you. Mark Liberman’s post on Obama’s language may also help.

Like Dan and Mark, one of my favorite papers on priors is Monroe, Colaresi and Quinn (2008). It’s useful because it walks through so many (inferior) methods before presenting priors and what they get you. For example, they show that in reproductive rights debates in Congress, Democratic representatives disproportionately use women, woman, right, decision, her, doctor, while Republicans use words like babies, kill, procedure, abort, and mother. These very different framings make sense and are interpretable. Other results aren’t nearly as clean.

For more background on priors, you might want to check out the current draft of the update for Jurafsky & Martin’s Speech and Language Processing, which is one of the most widely used textbooks for NLP. Check out the chapter on classification and the one on building affective lexicons.

A few other tricks

Dan mentioned a few techniques that are simple and surprisingly effective. For example, building a system that really understands negation is very difficult. But you get a long way by just detecting linguistic negation (not, n’t, never) and then doing something that amounts to flipping any positive words that follow, up to a comma or period. This method will get you the right classification for something like It’s not a hilarious or thrilling film, but it is majestic. You detect that not is a negative word and therefore treat hilarious and thrilling as negative, too. Otherwise you’d think this mostly negative sentence was mostly positive.

Another clever technique comes from Peter Turney, who wanted to get the semantic orientation of phrases in reviews of cars, banks, movies, and travel destinations: what phrases point to a thumbs up versus a thumbs down?

Knowing that things like “preposition + article” (on the, at a) don’t do much affective work, he came up with 5 part-of-speech patterns that basically gave him meaningful adjectives or adverbs with a bit of context (see Dan’s slide 41).

Every heuristic has its blindspots and certainly phrases like mic drop, swipe right, and on fleek are opinion-bearing phrases. Because of their parts-of-speech, these would be excluded from Turney’s calculations. But the Turney algorithm does find things like very handy and lesser evil. And as Dan said, “The recovery of virtual monopoly as negative is pretty cool.”

Getting at what matters

How many retweets or Facebook likes will Super Bowl campaigns get? How many minutes was a movie, how much did an appetizer cost, how tall was your date?

These are all, arguably, objective facts that you can measure. But there is another way of thinking about them: they are entirely superficial aspects—and therefore not even really about the ad/movie/restaurant/date at all.

By contrast, how did sexism get combatted in last year’s Always #likeagirl campaign? A feeling of wonder in a movie, the awfulness of service, the engagement from your date: these are subjective matters but they can be assessed and reveal deeper insights than the surface facts that are easiest to count. Natural Language Processing is still about counting, but it opens the possibility to count what counts.

Conspiracy, complaints, and fraud: The language of reasons

10 Nov

Three separate threads have been whirling around my head for the last few months, so I was glad to have the opportunity to connect them a few weeks ago at UC Merced.

Thread #1: Fraud

Fraud is a big deal–the Association of Certified Fraud Examiners places the amount of global fraud loss at $3.7 trillion per year.

If you want to detect fraud, you can’t just look for people writing, “I am committing fraud”. Instead, you look for evidence of the fraud diamond: opportunity, pressure, capability, and the focus of my talk– rationalization.

But one of the things that I’ve been thinking about is: how do people rationalize? That is to say, how do they give reasons to themselves and others to make something okay? I like Karen Horney’s words: “Rationalization may be defined as self-deception by reasoning.”

Thread #2: Customer Complaints

Last week, I wrote a bit about how people use intensifiers when they are filing complaints. Another thing that is prominent in complaint-giving is reasoning. 25% of customer complaints logged with the Consumer Financial Protection Bureau have the word because in them. Here’s an example of the basic structure of because–in English, you can swap the order, but in both speech and writing, people almost always put the result before the cause:

  • Result: We strongly suggest someone look into Citimortgage’s business practices,
  • Cause: because at best they are completely incompetent, and at worst they are committing acts of fraud

In these narratives of what happened, people give reasons for their actions and feelings, but they also attribute reasons to banks and other financial institutions. Reason-giving is bound up in explaining the ways in which customers have been affected and how things should be remedied.

Thread #3: Conspiracy Theorists

Okay, this one is mostly in here because it’s fun.

Towards the end of the summer, two Idibonites started looking at what it is, linguistically, that makes people sound rational versus paranoid. We’re not ready to release our “statistical model of paranoia” yet, but one of the things Jana and Charissa have found has to do with how people give reasons. About 7.8% of /r/conspiracy posts have the word because in them. In the previous section, I noted that consumer complaints about banks had a rate of 25%. So 7.8% is a lot less than that–but if you look across all the Reddit forums, the rate of because in /r/conspiracy puts it in the top quartile of most-because-y. (See below for the ones that get up to 16%.)

Some favorite findings

You can watch the presentation or flip through the slides, but here are probably my favorite points.

  • When a customer complaint about a bank involves a “because”, it’s a much longer complaint. This seems to also be a feature within Reddit.
  • Because is associated with highly emotional content in many domains—ranging from soap opera dialog to speeches in the British Parliament.  Reasoning isn’t separate from emotion, it’s built on it.
  • Becauses are much more common in conversations about accounts and mortgages than credit reporting or debt collection.
  • The subreddits where people give the most reasons (highest percentages of because) include those that are specifically about debating (/r/changemyview, /r/DebateaChristian), those that tackle gender and sexism (/r/AgainstGamerGate), and those that have to do with romance (/r/relationship_advice, /r/relationships).
  • Among /r/conspiracy authors, the biggest because users tend to talk about JFK, 9/11, aliens, and space.

You can check out the video recording of the full presentation here:

Here are the slides from the presentation:

Intensity in consumer complaints about banks

30 Oct

Analyzing the language used in consumer complaints tells you about both the topics that people are complaining about and their severity. An appreciation for what people are saying can help you build better products, save valuable customers, and fix problems earlier. In the case of financial service complaints, customer language can also expose what’s known in regulation circles as “unfair, deceptive, or abusive acts or practices” (UDAAP). There were $2.5b in UDAAP settlements in 2014, up 30% from 2013.

In this post we take one small but revealing aspect of language: intensifiers. There are a lot of ways that people show intensity–in speech, they increase their volume, in text they may use ALL CAPS or rows of exclamation points. But right now let’s look at words that are traditionally called “intensifiers”–like very and really. Explicit accusations of deception often come with intensifiers–but as is often the case with language, a word that accompanies explicit accusations also helps pinpoint implicit ones. And outside of accusations of deception, intensifiers also help identify highly emotional content.

In daily conversation, people usually use intensifiers about positive things. People talk about really enjoying things and how they are really neat. They say thanks very much and that things are very interesting. That said, people’s everyday speech also has a lot of very important and very difficult. In Spanish speech, the words that usually occur with muy are bien, poco, importante, and difícil. These are common in Portuguese, too–muito (‘very’) also goes with bem (‘well’), importante, and difícil. Regardless of your native language, if you reflect on where  intensifiers appear, you’ll see they aren’t just used to intensify verbs and adjectives–they’re used to intensify a speaker/author’s commitment to a claim.

Take a look at how they are used in customer complaints lodged against financial institutions. Looking closely at intensifiers identifies issues with customer service as well as unfair, deceptive, and abusive acts and practices:

Chase’s lack of appropriate and timely processing of my family’s request is literally forcing us into foreclosure but I struggle to keep my mortgage current b/c of the adverse professional ramifications.

Please help me they prey on people that are poor and withouta car I cant work. I have gotten soooo mad and it is not good for my health

I explained to him I want to pay my loan I just can not afford the xxxx withdraws of $240.00 bi-weekly, he was extremely rude and ridiculed me saying he could not help me with anything until my account made it his way

tHey are now out of business filed bankruptcy sold their portfolio to a third party and cant be found. PRO-COLLECT IS ILLEGALLY TRYING TO COLLECT ON ILLEGAL BILLING STATEMENTS THAT ARE TOTALLY FALSE AND WITHOUT MERIT.

Overall, about 30% of complaints against financial firms include intensifiers. Reddit provides an interesting contrast set because they have tens of thousands of forums focused on very different matters. The median percentage of posts-with-intensifiers in Reddit forums is 15%.  Only 5% of all of Reddit forums have as many intensifiers as complaints about banks–for Reddit, these are highly emotional topics having to do with problems in romantic relationships and debates on religion or gender. In the financial service complaints data, the very highest percentage of intensifiers is in Mortgages–that’s when people are talking about their families losing their homes, so it’s no wonder that it’s so high.

We can get more granular than Mortgages. Across all different kinds of financial products, let’s look at what sort of issues customers use intensifiers with disproportionately:

  • Can’t repay my loan
  • Loan modification, collection, foreclosure
  • Application, originator, mortgage broker
  • Dealing with my lender or servicer
  • Problems when you are unable to pay
  • Problems caused by my funds being low
  • Communication tactics

In other words, people are using intensifiers in highly-fraught situations when their homes or possessions are on the line, as well as when they feel like there is problematic communication. This recalls one of the major findings about one-star ratings in Yelp reviews–they are rarely about food, they are about awful service.

For a sense of contrast, here are categories where consumers use fewer intensifiers than we’d expect if everything were just random:

  • Incorrect information on credit report
  • Improper use of credit report
  • Unable to get credit report/credit score
  • Credit reporting company’s investigation

This also means that while people issue complaints to credit bureaus, they don’t use that many intensifiers–so complaints against TransUnion, Experian, and Equifax have low rates of intensifiers. The highest rates of intensifiers in complaints go with companies like Green Tree Servicing, Enhanced Recovery Company, Ocwen, NationStar Mortgage, and Wells Fargo. That’s particular because while bad credit ratings definitely affect people, it’s not as intense an emotional situation as a home being on the line. Automated processes are also seen differently than direct contact with humans (loan officers, etc).

Intensifiers are just a tiny aspect of assessing risk. Ideally, you want a system that considers all kinds of words and phrases–actually, you want to detect these automatically and give them weights based on the statistical strength of their signal. To learn more about the ways that adaptive machine intelligence works to do this, check out this blog post or our use cases page


Get every new post delivered to your Inbox.

Join 75 other followers