August | 2015 | Corpus linguistics

Archive | August, 2015

Don’t mention museums! Tips for couchsurfers and sentiment analysers

I had the great pleasure of hosting a webinar with Vita Markman and Chris Potts. Vita joined us from LinkedIn where she is an engineer handling all sorts of natural language processing (NLP) tasks. Chris joined us from Stanford, where he is an associate professor of linguistics and director of the Center for the Study of Language and Information (CSLI).

One the problems that sentiment analysis runs into is similar to any other classification problem: what’s in and what’s out for each category? Chris had examples like:

Many consider the masterpiece bewildering, boring, slow-moving, or annoying

In this case, something is called a masterpiece, but it’s also reportedly much-maligned. Depending on what you’re doing with sentiment analysis, you may want to deal with reported information differently than someone talking about their direct experience. It’s a lot harder to get people to agree on how to categorize emotions when they’re embedded in something like an I heard that you feared that he sensed that she thought that they said that everyone absolutely loved it.

Classification requires consistency

When Vita and Chris talk about experimental design, this is an important part–defining categories so that humans are consistent is a crucial step for getting machines to be able to automatically classify something. That’s true whether you’re classifying social media in terms of sentiment or extracting person names from Korean product reviews.

Vita gave the example of a former colleague wanting to crowdsource emotionally-charged language–but they couldn’t define what that meant. Machines can learn patterns automatically from large sets of data, but they have to learn from something. Unless you (and your team) can give exemplars and consistently label the categories you care about, it’s hard to get other people or machines to do the classification correctly.

The extra wrinkle in analyzing automatic classifications is that correlations sometimes behave in ways we don’t expect. As Chris says about trying to measure team effectiveness through politeness and sentiment, “productive teamwork might be possible only if people feel empowered to express frustration, which will be read as negativity correlating with a desirable team outcome.” This is the case with speed-dating, too, in which saying something negative about each other correlates to a positive speed-dating experience.

Training on your data is better than training on someone else’s

Another aspect we talked about in the webinar had to do with appreciating domain-specificity. It’s often a bad idea to try to treat a model from one set of data as something generic that can be applied to any other kind of data. Consider Couchsurfing.com. Chris analyzed what words went with people who were identified by their hosts as good surfers and which ones weren’t. What hosts really wanted were people who engaged with them and weren’t just using the couch as merely a landing pad. As Vita said after he showed the results in the webinar, “I have never seen museum in a negative context before…[it] reinforces how domain-specific and how context- and people-specific sentiment words can be.”

Bringing in context is also how you know what to do with something like You’re terrible!

If everyone is smiling and laughing, there’s a pretty good chance that’s positive even though on the face of it telling someone they are terrible should be negative. This is also how Chris addresses how to think of sarcasm–there’s a nice layout of this in the webinar, walking through what bits of context you could lean on to get the sentiment right for Yeah, great idea.

We also talk a bit about politeness, power, reputation, emotion. Near-and-dear to my own heart is the idea of positioning. In the webinar, we discussed work on social balance/social status. Understanding how to impute social relationships from words and other features helps you understand how to interpret something potentially ambiguous like You’re one crazy {expletive}!

Easy-to-implement practicalities

We also talked practicalities, like Vita’s helpful suggestion about how you find key phrases that are meaningful, rather than just popular. Let’s say you’re looking for bigrams and trigrams that matter. If you just use frequency, you’ll end up with lots of prepositional phrases like of your department or non-topical things like good morning. She shows how to drop those so that you can focus on things like jobs on LinkedIn or talent solutions.

We also chat a bit about cleaning up the data, which is always important. An additional point from Vita here: people often remove “stop words” because they can get in the way of seeing trends. Stop words are little, frequent words like of, may and the. One of the most important things to consider, says Vita, is negation. Negations like not and never are often removed but that can give you a very inaccurate reading about what’s going on.

Vita has mentioned these examples:

rarely arrived on time

cd arrived without case

no issues with delivery. arrived promptly

no delivery. issues with shipping.

If you don’t know about rarely or without, you won’t understand what’s going on in the first example. And if you don’t understand the “scope” of no in the two other examples, your system won’t understand that (3) is reassuring to a company while (4) may suggest a big problem.

Go watch the webinar to get even more ideas and contact us at info@idibon.com if you’d like to hear how we help with consistent, context-specific, easy and actionable insights.

Comments Leave a Comment
Categories Uncategorized

Emoji use: Who, where, how

20 Aug

Emoji are on the rise. People on their smartphones and on social media use emoji to add a visual key to their message. Today, emoji are being used in advertising, in the courtroom, and even in recent political campaigns. To learn more about how emoji are being used in the business world, you can check out the blog post and video here.

There were 722 emoji when the Unicode 6.0 character set was released in 2010 and one hundred more have–and will–be added. So it’s not surprising that not all emoji are used equally. What are the most frequently used emoji? Are some emoji used and interpreted differently across different cultures and groups of people? And do people really use emoji to communicate strong emotions or are they more of a whimsical addition to a text message?

Check out this video to learn about the who, where, and how of emoji use around the world!

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Don’t mention museums! Tips for couchsurfers and sentiment analysers

Classification requires consistency

Training on your data is better than training on someone else’s

Easy-to-implement practicalities

Emoji use: Who, where, how

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?