People like John Wayne

27 Jun

In natural language processing (NLP), it is common for systems to ignore fragments of greater than three words. This makes sense from a machine-learning point of view, as the number of possible combinations of four or more words in any language is astronomically high and quickly hits memory and processing constraints.

But a fragment, like the title of this article, can be ambiguous:

  1. People like John Wayne movies
  2. People like John Wayne in movies
  3. People like John Wayne act in movies

The main topic of the first sentence ‘John Wayne movies’, the main topic of the second sentence is just ‘John Wayne’ and the main topic of the third sentence is ‘movies’—they are subtly more focussed on different parts of the sentence. The first two examples above express positive sentiment, the third does not. Without a deeper understanding of the sentences, just a few extra words can make a big difference.

The third sentence uses ‘like’ to encode information in a way that the first two do not: John Wayne is a person. It is obvious to us, but maybe not to a computational knowledge base. By interpreting these kinds of sentences correctly, we can therefore extract information about the world, in addition to understanding the sentences themselves. This article is about this third type of example: People like John Wayne act in movies.

In computational linguistics, we called these kinds of relationships hyponyms: “states like California”, “companies like IBM”, “trees like oak”, etc. They represent a specific type or instance of the general form (See Marti Hearst’s seminal 1992 work).

“I went to the general store but they wouldn’t let me buy anything specific.” – Steven Wright

One of the ways to give someone a toehold in a conversation or a report is to give an example. People often have a hard time with abstractions, so examples help increase understanding. They offer other rhetorical benefits, too: you may also want to use examples to introduce some particular example that you want to make into the main topic. Maybe you want to talk about GE, the US Supreme Court, Havana, the Outback, Rafael Nadal, or Nelson Mandela.

I’ve been thinking about this since I saw the phrase “carriers like Lufthansa”. There’s a specific. Let’s go general. I give you the construction “X like Y” and tell you that “Y” is some kind of named entity. What kind of Y is most common: Location, Organization, or Person? Whether you’re doing sentiment analysis, opinion mining, question-and-answer matching, or any of a number of natural language processing tools, you want your system to be able to identify and and distinguish these kinds of named entities at a minimum: e.g., “what am I doing sentiment analysis about?”. Part of building a good system is understanding the distributions in contexts like the one we’re talking about here.

For a first pass, let’s go to the Corpus of Contemporary American English, which is a great site for exploring these kinds of questions very quickly. I look for all “Noun like ProperNoun” constructions and grab all of the matches that have at least 3 occurrences. The number one most popular example is “states like California”. (Btw, often California is the only example state listed. Well may we wonder: how many other states *are* like California? When examples are provided, the most California-like states are Florida and Texas. Your objections are noted.)

This example illustrates the general theme, journalists love “states/cities/places/countries/places like Y”. 62% of all of the “X like Y” examples have Y as a location. (It’s 48% if you use “types” instead of “tokens”. Tokens let you count each occurrence of “states like California”, types say “nope, that’s just one instance, even if it occurs 87 times”.) The next thing to note, of course, is how often organizations/locations/people are mentioned throughout the corpus. I haven’t relativized the COCA numbers below but in general, we see organizations get the most mentions. (Alas, organizations are also the hardest of these three labels to get right.)

Table of "X is Y" construction in Corpus of Contemporary American English

– Tyler Schnoebelen (@TSchnoebelen)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: