Turn these words around in your mouth: fiesta, polo, eon, jazz, rio, rush. They are all great words. They are also the names of some of the top-selling car models from Ford, Volkswagen, Hyundai, Honda, Kia, and Toyota. On the one hand: cool. On the other hand, ARGH.
If you think of your organization’s name, its products, its services, and its features, odds are at least some of them are common words (unless you’re in pharmaceuticals). But while Honda brand managers may really like Sonny Rollins’ “Saxophone Colossus”, they don’t usually want to be tracking its popularity when they are looking at social media or other communications. But even in Japanese, jazz = jazz = ジャズ = ジャズ.
It’s the nature of words to be ambiguous—when Sprint actually is about the telecommunications company, it’s still sometimes about services and sometimes about the corporate entity. If you’re doing something like sentiment analysis, you might lump them together, but you might want to keep them separate just as you might want to keep paid/owned mentions (by customer support and corporate brand managers) separate from organic items (from everyday folks who aren’t paid to talk about Sprint).
The key is to be able to carve up the world into categories that are meaningful. It’s only when categories are meaningful that you can get insights and take actions. Disambiguation is a crucial requirement for understanding.
Ambiguity, relevance and 3 big brands
In this blog post, we take three common words and show how important relevance is: Sprint, Tesla, and Procter & Gamble’s detergent, Tide. I’ve focused on English and already removed spam. (Here’s some recent stuff on French and Spanish, Korean, Russian, and some non-English punctuation marks you need to start using.)
For Tesla, the proportion of tweets having to do with the car company and its vehicles has held pretty consistent. The rate of mentioning Nikola Tesla and tesla coils is relatively stable.
The relevance rates for Tide and Sprint go up and down with the football season (the Crimson Tide at Alabama) and work/exercise cycles (software sprints but more often, running/races).
Another reason why you need an adaptable system: you have to figure out whether news about, say, the NASCAR Sprint Cup counts as having to do with Sprint or not. The company is investing in the competition but when someone is excited about Joey Logano winning, how relevant is it to brand managers? It’s a matter of taste and what business questions you want to answer. It’s irrelevant if you just want to know how people feel about network coverage but it’s relevant if you’re tracking marketing campaign reach. Defining what counts and having a tool that can handle your custom definitions is important.
It’s also worth noting that something like sentiment isn’t randomly distributed. The bulk of irrelevant Tesla’s are about inventor/scientist Nikola Tesla. Tweets about Tesla cars are fairly positive but if you don’t disambiguate and get rid of Nikola Tesla references, you’ll end up with a sense that people are far more positive about the cars than they are. People who talk about Nikola Tesla almost always love him. Mustache.
Capital-S people and Lowercase-S people
In the case of Sprint and Tide, you could say that people who capitalize are more likely talking about the right thing. For Sprint, relevance goes up to 73.80% when you just use the capitalized form, while only 43.56% of people mentioning sprint are talking about the company and its phone services. But if you restricted yourself to only the capital, you’d be missing a lot of data: you’d be getting only 63.97% of all the conversations you want if you ignore lower-case sprint.
The real problem is that capitalization conventions are not randomly distributed across people. That is to say, the types of people you get talking about Sprint are different than the type talking about the same company/services but referring to them as sprint.
The people who use the capital-S tend to be active in technology: they keep blogs, talk about science/gadgets, a lot of them identify as husbands/fathers. The lower-S users who talk about sprint tend to be younger, swear more and talk more about travel and sports. If you’re a brand manager who only cares about developers and geeks, then you’re fine just looking for people who use proper Sprint capitalization. But you’re going to be missing a lot of data and a lot of the variety of perspectives…which is the whole point of tapping into social media.
Conclusion for marketers and data scientists
Whether you’re doing sentiment analysis, Named Entity Extraction, intent-to-buy or key influencer tracking, you need to make sure what you’re structuring is relevant to your needs. You may want to track swim sprints, tesla coils, and high tides but you want to do that on purpose, not by accident. Don’t be satisfied with helter-skelter analytics: garbage in, garbage out. There’s a lot of language out there that’s worth understanding and there are tools that work. Drop us a line.
– Tyler Schnoebelen (@TSchnoebelen)