Conspiracy, complaints, and fraud: The language of reasons

10 Nov

Three separate threads have been whirling around my head for the last few months, so I was glad to have the opportunity to connect them a few weeks ago at UC Merced.

Thread #1: Fraud

Fraud is a big deal–the Association of Certified Fraud Examiners places the amount of global fraud loss at $3.7 trillion per year.

If you want to detect fraud, you can’t just look for people writing, “I am committing fraud”. Instead, you look for evidence of the fraud diamond: opportunity, pressure, capability, and the focus of my talk– rationalization.

But one of the things that I’ve been thinking about is: how do people rationalize? That is to say, how do they give reasons to themselves and others to make something okay? I like Karen Horney’s words: “Rationalization may be defined as self-deception by reasoning.”

Thread #2: Customer Complaints

Last week, I wrote a bit about how people use intensifiers when they are filing complaints. Another thing that is prominent in complaint-giving is reasoning. 25% of customer complaints logged with the Consumer Financial Protection Bureau have the word because in them. Here’s an example of the basic structure of because–in English, you can swap the order, but in both speech and writing, people almost always put the result before the cause:

  • Result: We strongly suggest someone look into Citimortgage’s business practices,
  • Cause: because at best they are completely incompetent, and at worst they are committing acts of fraud

In these narratives of what happened, people give reasons for their actions and feelings, but they also attribute reasons to banks and other financial institutions. Reason-giving is bound up in explaining the ways in which customers have been affected and how things should be remedied.

Thread #3: Conspiracy Theorists

Okay, this one is mostly in here because it’s fun.

Towards the end of the summer, two Idibonites started looking at what it is, linguistically, that makes people sound rational versus paranoid. We’re not ready to release our “statistical model of paranoia” yet, but one of the things Jana and Charissa have found has to do with how people give reasons. About 7.8% of /r/conspiracy posts have the word because in them. In the previous section, I noted that consumer complaints about banks had a rate of 25%. So 7.8% is a lot less than that–but if you look across all the Reddit forums, the rate of because in /r/conspiracy puts it in the top quartile of most-because-y. (See below for the ones that get up to 16%.)

Some favorite findings

You can watch the presentation or flip through the slides, but here are probably my favorite points.

  • When a customer complaint about a bank involves a “because”, it’s a much longer complaint. This seems to also be a feature within Reddit.
  • Because is associated with highly emotional content in many domains—ranging from soap opera dialog to speeches in the British Parliament.  Reasoning isn’t separate from emotion, it’s built on it.
  • Becauses are much more common in conversations about accounts and mortgages than credit reporting or debt collection.
  • The subreddits where people give the most reasons (highest percentages of because) include those that are specifically about debating (/r/changemyview, /r/DebateaChristian), those that tackle gender and sexism (/r/AgainstGamerGate), and those that have to do with romance (/r/relationship_advice, /r/relationships).
  • Among /r/conspiracy authors, the biggest because users tend to talk about JFK, 9/11, aliens, and space.

You can check out the video recording of the full presentation here:

Here are the slides from the presentation:


Intensity in consumer complaints about banks

30 Oct

Analyzing the language used in consumer complaints tells you about both the topics that people are complaining about and their severity. An appreciation for what people are saying can help you build better products, save valuable customers, and fix problems earlier. In the case of financial service complaints, customer language can also expose what’s known in regulation circles as “unfair, deceptive, or abusive acts or practices” (UDAAP). There were $2.5b in UDAAP settlements in 2014, up 30% from 2013.

In this post we take one small but revealing aspect of language: intensifiers. There are a lot of ways that people show intensity–in speech, they increase their volume, in text they may use ALL CAPS or rows of exclamation points. But right now let’s look at words that are traditionally called “intensifiers”–like very and really. Explicit accusations of deception often come with intensifiers–but as is often the case with language, a word that accompanies explicit accusations also helps pinpoint implicit ones. And outside of accusations of deception, intensifiers also help identify highly emotional content.

In daily conversation, people usually use intensifiers about positive things. People talk about really enjoying things and how they are really neat. They say thanks very much and that things are very interesting. That said, people’s everyday speech also has a lot of very important and very difficult. In Spanish speech, the words that usually occur with muy are bien, poco, importante, and difícil. These are common in Portuguese, too–muito (‘very’) also goes with bem (‘well’), importante, and difícil. Regardless of your native language, if you reflect on where  intensifiers appear, you’ll see they aren’t just used to intensify verbs and adjectives–they’re used to intensify a speaker/author’s commitment to a claim.

Take a look at how they are used in customer complaints lodged against financial institutions. Looking closely at intensifiers identifies issues with customer service as well as unfair, deceptive, and abusive acts and practices:

Chase’s lack of appropriate and timely processing of my family’s request is literally forcing us into foreclosure but I struggle to keep my mortgage current b/c of the adverse professional ramifications.

Please help me they prey on people that are poor and withouta car I cant work. I have gotten soooo mad and it is not good for my health

I explained to him I want to pay my loan I just can not afford the xxxx withdraws of $240.00 bi-weekly, he was extremely rude and ridiculed me saying he could not help me with anything until my account made it his way

tHey are now out of business filed bankruptcy sold their portfolio to a third party and cant be found. PRO-COLLECT IS ILLEGALLY TRYING TO COLLECT ON ILLEGAL BILLING STATEMENTS THAT ARE TOTALLY FALSE AND WITHOUT MERIT.

Overall, about 30% of complaints against financial firms include intensifiers. Reddit provides an interesting contrast set because they have tens of thousands of forums focused on very different matters. The median percentage of posts-with-intensifiers in Reddit forums is 15%.  Only 5% of all of Reddit forums have as many intensifiers as complaints about banks–for Reddit, these are highly emotional topics having to do with problems in romantic relationships and debates on religion or gender. In the financial service complaints data, the very highest percentage of intensifiers is in Mortgages–that’s when people are talking about their families losing their homes, so it’s no wonder that it’s so high.

We can get more granular than Mortgages. Across all different kinds of financial products, let’s look at what sort of issues customers use intensifiers with disproportionately:

  • Can’t repay my loan
  • Loan modification, collection, foreclosure
  • Application, originator, mortgage broker
  • Dealing with my lender or servicer
  • Problems when you are unable to pay
  • Problems caused by my funds being low
  • Communication tactics

In other words, people are using intensifiers in highly-fraught situations when their homes or possessions are on the line, as well as when they feel like there is problematic communication. This recalls one of the major findings about one-star ratings in Yelp reviews–they are rarely about food, they are about awful service.

For a sense of contrast, here are categories where consumers use fewer intensifiers than we’d expect if everything were just random:

  • Incorrect information on credit report
  • Improper use of credit report
  • Unable to get credit report/credit score
  • Credit reporting company’s investigation

This also means that while people issue complaints to credit bureaus, they don’t use that many intensifiers–so complaints against TransUnion, Experian, and Equifax have low rates of intensifiers. The highest rates of intensifiers in complaints go with companies like Green Tree Servicing, Enhanced Recovery Company, Ocwen, NationStar Mortgage, and Wells Fargo. That’s particular because while bad credit ratings definitely affect people, it’s not as intense an emotional situation as a home being on the line. Automated processes are also seen differently than direct contact with humans (loan officers, etc).

Intensifiers are just a tiny aspect of assessing risk. Ideally, you want a system that considers all kinds of words and phrases–actually, you want to detect these automatically and give them weights based on the statistical strength of their signal. To learn more about the ways that adaptive machine intelligence works to do this, check out this blog post or our use cases page

Humans can barely understand emojis. Will machines do any better?

22 Sep

The human skull has 14 facial bones and 35 muscles wrapping around these bones. That anatomy works together to form everything from grimaces, to grins, to mouths agape. Beyond the face, there are all kinds of cues that you can use to understand someone: voice contours, body language, and eye contact, to name a few.

All this context disappears when we switch to text. Emojis and emoticons help fill in the gap. They let us express a stance; for instance, “Ok” can connote “I’m a little bothered,” but “Ok :)” means the situation really is okay. As a special bonus, in addition to some 130 available facial expressions, emojis let us style ourselves into sleepy pandas, sparkle tigers, and thousands of otherwise-impossible contortions.

While plasticity is part of what makes emojis fun to use, it’s also what can make them complex to understand. But, as more communication migrates to digital avenues—think about how often you text versus how often you make a phone call—deciphering our 21st-century shorthand is becoming essential.

Screen Shot 2015-09-22 at 9.27.59 AM

Continue reading full article: Check out the full article from Qualcomm!

Don’t mention museums! Tips for couchsurfers and sentiment analysers

31 Aug

I had the great pleasure of hosting a webinar with Vita Markman and Chris Potts. Vita joined us from LinkedIn where she is an engineer handling all sorts of natural language processing (NLP) tasks. Chris joined us from Stanford, where he is an associate professor of linguistics and director of the Center for the Study of Language and Information (CSLI).

One the problems that sentiment analysis runs into is similar to any other classification problem: what’s in and what’s out for each category? Chris had examples like:

Many consider the masterpiece bewildering, boring, slow-moving, or annoying

In this case, something is called a masterpiece, but it’s also reportedly much-maligned. Depending on what you’re doing with sentiment analysis, you may want to deal with reported information differently than someone talking about their direct experience. It’s a lot harder to get people to agree on how to categorize emotions when they’re embedded in something like an I heard that you feared that he sensed that she thought that they said that everyone absolutely loved it.

Classification requires consistency

When Vita and Chris talk about experimental design, this is an important part–defining categories so that humans are consistent is a crucial step for getting machines to be able to automatically classify something. That’s true whether you’re classifying social media in terms of sentiment or extracting person names from Korean product reviews.

Vita gave the example of a former colleague wanting to crowdsource emotionally-charged language–but they couldn’t define what that meant. Machines can learn patterns automatically from large sets of data, but they have to learn from something. Unless you (and your team) can give exemplars and consistently label the categories you care about, it’s hard to get other people or machines to do the classification correctly.

The extra wrinkle in analyzing automatic classifications is that correlations sometimes behave in ways we don’t expect. As Chris says about trying to measure team effectiveness through politeness and sentiment, “productive teamwork might be possible only if people feel empowered to express frustration, which will be read as negativity correlating with a desirable team outcome.” This is the case with speed-dating, too, in which saying something negative about each other correlates to a positive speed-dating experience.

Training on your data is better than training on someone else’s

Another aspect we talked about in the webinar had to do with appreciating domain-specificity. It’s often a bad idea to try to treat a model from one set of data as something generic that can be applied to any other kind of data. Consider Chris analyzed what words went with people who were identified by their hosts as good surfers and which ones weren’t. What hosts really wanted were people who engaged with them and weren’t just using the couch as merely a landing pad. As Vita said after he showed the results in the webinar, “I have never seen museum in a negative context before…[it] reinforces how domain-specific and how context- and people-specific sentiment words can be.”

Bringing in context is also how you know what to do with something like You’re terrible!

Screen Shot 2015-08-31 at 1.48.53 PM

If everyone is smiling and laughing, there’s a pretty good chance that’s positive even though on the face of it telling someone they are terrible should be negative. This is also how Chris addresses how to think of sarcasm–there’s a nice layout of this in the webinar, walking through what bits of context you could lean on to get the sentiment right for Yeah, great idea.

We also talk a bit about politeness, power, reputation, emotion. Near-and-dear to my own heart is the idea of positioning. In the webinar, we discussed work on social balance/social status. Understanding how to impute social relationships from words and other features helps you understand how to interpret something potentially ambiguous like You’re one crazy {expletive}!

Easy-to-implement practicalities

We also talked practicalities, like Vita’s helpful suggestion about how you find key phrases that are meaningful, rather than just popular. Let’s say you’re looking for bigrams and trigrams that matter. If you just use frequency, you’ll end up with lots of prepositional phrases like of your department or non-topical things like good morning. She shows how to drop those so that you can focus on things like jobs on LinkedIn or talent solutions.

We also chat a bit about cleaning up the data, which is always important. An additional point from Vita here: people often remove “stop words” because they can get in the way of seeing trends. Stop words are little, frequent words like of, may and the. One of the most important things to consider, says Vita, is negation. Negations like not and never are often removed but that can give you a very inaccurate reading about what’s going on.

Vita has mentioned these examples:

  • rarely arrived on time
  • cd arrived without case
  • no issues with delivery. arrived promptly
  • no delivery. issues with shipping.

If you don’t know about rarely or without, you won’t understand what’s going on in the first example. And if you don’t understand the “scope” of no in the two other examples, your system won’t understand that (3) is reassuring to a company while (4) may suggest a big problem.

Go watch the webinar to get even more ideas and contact us at if you’d like to hear how we help with consistent, context-specific, easy and actionable insights.

Emoji use: Who, where, how

20 Aug

Emoji are on the rise. People on their smartphones and on social media use emoji to add a visual key to their message. Today, emoji are being used in advertising, in the courtroom, and even in recent political campaigns. To learn more about how emoji are being used in the business world, you can check out the blog post and video here.

There were 722 emoji when the Unicode 6.0 character set was released in 2010 and one hundred more have–and will–be added. So it’s not surprising that not all emoji are used equally. What are the most frequently used emoji? Are some emoji used and interpreted differently across different cultures and groups of people? And do people really use emoji to communicate strong emotions or are they more of a whimsical addition to a text message?

Check out this video to learn about the who, where, and how of emoji use around the world!

Alex Korbonits turns a computer into James Joyce: Deep Learning from text (and images)

27 Jul

Last week we had the pleasure of welcoming Alex Korbonits to speak at Idibon about Deep Learning. Practically speaking, Alex gave us the low-down on the different tools that people are using to do Deep Learning. Inspirationally speaking, he showed us how computers imagine the Seattle skyline and how they would write if you taught them only James Joyce.

So let’s start with the ooooh and end with the hooooww.

Deep Learning, like other forms of machine learning, is about finding patterns in data. The “depth” of Deep Learning is that it involves a bunch of layers that feed into each other, each layer extracts higher-level features until the final layer where a decision is made.

One of the ways to figure out what’s happening in all those layers is to ask the computer to exaggerate what it’s finding at a given layer. That’s how you get stuff like the doge below—because Google’s training information contains so many dogs and faces, it “sees” dogs and eyes all over the place. (For more information check out Google’s blog here.)

On the literature front, Alex loves James Joyce, so he wanted to see what Deep Learning would do if he gave it Ulysses and said, “Write me something.” You can read about how he did this (so you can do it yourself…I am) on his blog here. You’ll need to read this as poetry if you’re going to enjoy it.

Bloom works. Quick! Pollyman. An a lot it was seeming, mide, says, up and the rare borns at
Leopolters! Cilleynan’s face. Childs hell my milk by their
doubt in thy last, unhall sit attracted with source
The door of Kildan
and the followed their stowabout over that of three constant
trousantly Vinisis Henry Doysed and let up to a man with hands in surresses afraid quarts to here over
someware as cup to a whie yellow accept thicks answer to me.

As Alex notes, all that he gave the computer was Ulysses. It didn’t know English at all, yet it’s able to make up words that are fairly English-like and it even gets some grammar right-ish–notice the prepositions.

Meanwhile, Andrej Karpathy feeds in Tolstoy and gets:

Pierre aking his soul came to the packs and drove up his father-in-law women.

(Now to synthesize! Read Gretchen McCulloch’s Grammar of Doge.)

Time to get practical. Here are the Deep Learning tools that Alex reviewed for us:

  • Torch: This is what Alex actually used for his James Joyce project; one of its benefits is a large number of packages so you don’t have to start from zero. Additionally, it is increasingly the tool of choice used for doing deep learning research.
  • Caffe: Like Torch, has a large amount of work already done that you can build off of, one of the easier to use. One of its main strengths is its Model Zoo, where many reference models are already built and pretrained, so if you’re chomping at the bit, you don’t have to wait weeks to train a larger model such as AlexNet.
  • Theano: This is probably the most sophisticated tool out there but people tend to find it pretty complicated. However, lots of popular Pythonic projects are being created on top of it and used in places such as Kaggle competitions (check out Keras, PyLearn, and Lasagne).
  • GraphLab Create: From a company called Dato, this is also one of the easier tools to use if you’re just getting started. Given a dataset, their toolkit will pick a “sane default” network topology so that you don’t have to build one from scratch.

Since layers are a big part of Deep Learning, we’ll conclude with a picture of Alex presenting to us followed by what two different layers are seeing in the image—the first layer is seeing contours of shapes, while the second is hallucinating all kinds of different more abstract shapes.




Emoji: Why brands should pay attention

21 Jul

Ever since their inclusion in the Apple iOS system in 2011, there has been a surge in emoji use around the world. Today, emoji are not only used in text message, social media, and email correspondence, but can be found in literature, ad campaigns, and even the courtroom.

How are brands using emoji today? What are the dangers of omitting emoji when doing text analytics? And, most importantly, why should brands care about these little images in the first place?

Watch this short video to learn a bit about emoji, text analytics, and why brands have been and should be paying attention.

Want to learn more? Check out these blog posts on the grammar of emoji and on the eMomji phenomenon!