Archive | October, 2013

An NLP tutorial with Roger Ebert

30 Oct


Natural Language Processing is the process of extracting information from text and speech. In this post, we walk through different approaches for automatically extracting information from text—keyword-based, statistical, machine learning—to explain why many organizations are now moving towards the more sophisticated machine-learning approaches to managing text data.

Roger Ebert saw North in 1994. Paragraph number eight of ten in his review:

I hated this movie. Hated hated hated hated hated this movie. Hated it. Hated every simpering stupid vacant audience-insulting moment of it. Hated the sensibility that thought anyone would like it. Hated the implied insult to the audience by its belief that anyone would be entertained by it.

North: A comedy by Rob Reiner that Roger Ebert hated

Method 1: Check a dictionary of keywords

Imagine that you wanted to process Roger Ebert’s review to automatically add a star-rating to it. It would be pretty easy with this example. Ebert announces his hate ten times directly and gilds the lily with other lovely negative words like simpering, stupid, insulting, and insult. A really basic sentiment analysis tool will still get this review right thanks to these keywords.

I hated this movie. Hated hated hated hated hated this movie. Hated it. Hated every simpering stupid vacant audience-insulting moment of it. Hated the sensibility that thought anyone would like it. Hated the implied insult to the audience by its belief that anyone would be entertained by it.

If we had a simple dictionary of “positive” and “negative” words we would get the correct result. Even though we don’t understand that positive words like “entertained” are used for contrastive effect, the negative words in our dictionary outnumber the positive words, so the end result would be correct.

But if we take this approach to other reviews, we can see that this doesn’t work:

Young men: If you attend this crap with friends who admire it, tactfully inform them they are idiots. Young women: If your date likes this movie, tell him you’ve been thinking it over, and you think you should consider spending some time apart. (More Ebert on Battle: Los Angeles)

This is clearly a bad review, but the positive words outnumber the negative ones.

Method 2: Applying weights to words

Rather than just have a flat list of words in a dictionary marked as positive or negative, we could take a more statistical approach and give each word a weight for how positive or negative it is.

Many of the positive words, like “tactfully”, are not very strong words when compared to “crap” or “idiot”. If each word has a rating of positive or negative (say on a 0 to 100 scale) then we can combine the ratings for each word, and get a more accurate prediction of the correct rating for this review.

But even this approach can fail. Consider this example:

In my own very humble opinion, In Praise of Love lacks even the most fragmented charms I have found in almost all of his previous works. (Andrew Sarris)

In this case, you can immediately see that  praise and love are just titles, and so they should be ignored (as should any other names, for more on Named Entity Recognition, check out our posts about song titlesKentucky Derby horses, places that aren’t anywhere, or Burning Man camps and the S&P 1500.) You might also want to treat the negating word “lacks” as a clue that the positive words should actually be treated as negative. Further still, you might want to include more sophisticated rules to account for how people write reviews, for example: if there’s a combination of negative and positive words, assume that the overall review is negative (we tend not to use negative words for comparative effect in purely positive reviews).

At this point, it is fairly complicated. How do we know what weight to apply to each word? How should this weight change in negative or contrastive contexts? What rules should be applied and should they be definitive rules or should they contribute weights to an overall prediction alongside all the words? Should we move beyond words to phrases, but how can we add the weights to every possible phrase? There are more than 23 billion sequences of 3-words in the Google n-gram corpus for English alone, which puts even this relatively narrow context beyond the limits of purely human approaches.

Method 3: Machine learning

When we are talking about processing billions of combinations, we need the machines to take over.

The simplest machine learning approach for a task like sentiment analysis is called ‘Naive Bayes‘. It’s not that different from Method 2. Let’s assume that you have a set of movie reviews that already have ratings. You can use those existing reviews to determine the weight that should be applied to each word. In machine learning terminology, this is usually referred to as the probability. For example, there might be an 85% probability that when the word “brilliant” is used, it is in a 5-star review. This is one way that we can learn all the weights (probabilities) for the words or phrases.

(Side note: with Naive Bayes, and machine learning more broadly, we usually multiply the probabilities instead of adding them, and there are a few other statistical methods that are typically used to account for unbalanced data and to do sensible things with unknown or low-count items).

There are a number of methods for doing machine learning, each with their own strengths and weaknesses. Logistic regression (and related maximum entropy), are probably the most widely used approaches. If you see people talking about ‘models’ or ‘linear models’ in machine-learning circles, this is probably what they are referring to. They will arrive at more precise weights than Naive Bayes by taking into account that certain words/phrases etc., when used together, do not have the same probability when used independently. Support Vector Machines (SVMs) can be similar and the more complex SVMs can be sensitive to observations like “the difference between 1 and 2 negative words is more important than the difference between 7 and 8 negative words”. Decision trees turn the problem into rules, for example “if there is a positive word following a negation, it is negative”. Neural nets allow the most complex kinds of probabilistic rules between features, but are the most time-consuming and complicated to use.

What binds all of these together is the idea that you can have computers learn the appropriate combinations of words/phrases and more complex rules without being explicitly told “apply this rule, then this rule, then this rule” and without having to pre-decide what every single word or phrase should mean in a given context.

In an earlier post, we talked about why automated sentiment analysis is so hard and a lot of the answer came down to understanding context. That’s really what you’re seeing in this post, too. Using machine learning (rather than keywords) gets us closer to the contexts that are relevant for a given task like predicting the sentiment of a review. The quality of results is almost always helped by having large amounts of data. But for very high levels of accuracy, you also need folks working on the data who understand how language is being used in order to choose the best machine learning methods for the task and the data at hand.

– Tyler Schnoebelen (@TSchnoebelen)