Archive | February, 2014

Finding people in Korean and finding “people” in Korean

28 Feb


Quick, is Popeye a person? The answer really is, “it depends”. And we think that’s the right one.

Named Entity Recognition is one of the basic building blocks of natural language processing. It’s crucial for any kind of sentiment analysis or text analysis because if you aren’t sure what you’re talking about is Washington D.C., George Washington, or Washington Memorial Hospital then your results will be, well, meaningless.

One of our products is “information extraction” to identify People, Locations, Organizations and the like in any amount of text, in any language. If you’re asking us to help you do named entity identification, odds are you want to use it for some specific purpose. So flexibility is important.

This post is about some of the cases in Korean that we came across this week that I think are the most fun.

What’s a person?

First, do you want 뽀빠이 (‘Popeye’) and 스파이더맨 (‘Spider-Man’) or not? If you are trying to track and understand real-world events, probably not (sorry, fans). But if you are trying to understand the ways in which languages express the actions of people, then you probably do want them.


If Spider-Man isn’t a person is ‘web-sling’ not a verb? 

Moving to humans, what about  달라이 라마 (‘Dalai Lama’). Technically, this is the name of the position that heads of the the Gelug school of Tibetan Buddhism (the phrase is from Mongolian ‘ocean’ plus  Tibetan ‘teacher’).

The actual man that we call the Dalai Lama is བསྟན་འཛིན་རྒྱ་མཚོ་ (‘Tenzin Gyatso’), the 14th Dalai Lama. In most contexts, people who are referring to “The President” mean some particular president (in English news, Barack Obama). And in most texts—English or Korean, certainly—the Dalai Lama means a particular individual. But your system needs to be able to discern when you’re talking about a generic role and when you’re talking about a person. Not all languages have capitalization to help (and even in English, we don’t always use capitalization to make a distinction between the president and The President).

The importance of context

Let’s say you want to automatically extract real people. In that case, you probably don’t want to get 현무 (Hyunmoo), one of four legendary gods. But there’s a famous MC whose full name is 전현무 (Jeon Hyunmoo) who can be referred to as 현무.


Do you want gods and creatures to count as “People” in your Named Entity Recognition system? In Korea, there’s a famous MC who shares the name of this legendary god

Similarly, 탑 could be referring to a ‘pagoda’ or a ‘tower’, but it’s also the name of a famous rapper in the boy band Big Bang.

사야 is the name of a character in the movie Blood: The Last Vampire (Saya). But this also looks like the verb to buy when you put it in a ‘must buy’ context. One of the ways you’d want to incorporate “context” is to understand if you’re talking about movies. A simpler linguistic way is that Korean verbs appear at the end of sentences, so if the word 사야 is appearing at the end of a Korean sentence, odds are that the sentence is about shopping and not about a 400 year-old samurai.

Culture, a hot guy and an awesome alphabet

Sometimes what you have to understand is more cultural in nature. 세븐 is the way you represent the English word ‘seven’ in Korean. That is, you pronounce 세븐 as something like ‘sebuen’ in Korean. I bring this up for two reasons. One is show you how awesome the Hangul (Korean) alphabet is. The other reason is that it’s the name of a famous Korean singer (‘Se7en’ when it appears in Latin characters). I also call reason #2, “Friday eye candy”.


세븐 would like to introduce you to the Hangul alphabet

Back to the alphabet: 세 is a single character that represents a syllable. It’s made up of two parts: ㅅ, which is ‘s’ and ㅔ, which is ‘e’, so together they are ‘se’. Now things get even more exciting. The single character 븐 (‘buen’) is made up of ㅂ for the ‘b’ sound, ㅡ for an ‘eu’ sound AND ALSO THERE’S A ㄴ SMOOSHED IN THERE! That’s what gets the ‘n’ sound. You can see the full list of combinations here. There are 11,172 mathematically possible characters—although as you can imagine all but a couple thousand of those are basically impossible in Korean phonology.

At any rate, Hangul is such a cool alphabet that it gets a holiday: Hangul Day is on October 9th in South Korea. We like this alphabet so much that Hangul Day is one of our official company holidays. No joke.

Read more about…

– Tyler Schnoebelen (@TSchnoebelen)


English isn’t the only language of the heart

18 Feb


This post is intended to answer the question, “If you’re only paying attention to English, how much are you missing?” Its methods work with approximations but the conclusion is very real: you are missing a lot. I’m thinking about globally-focused CMOs and other folks in Marketing but hoping the lessons are even more widely applicable.

If you have an English-first global strategy, you don’t really have a global strategy. That’s probably already apparent to you. But let’s pretend that an English-only strategy might work by multiplying Internet users by GDP per capita. That is, let’s reflect the fact that people in countries have different purchasing power. To get this, I’ll average the latest per-capita GDP data from the IMF, the World Bank, the CIA Factbook, and the UN. The biggest five there are tiny countries: Monaco (whoa, $128,501.30 GDP per capita), Liechtenstein, and Luxembourg. If we multiply GDP per capita by the count of people in the countries with Internet access, then the top countries are:

  • United States
  • China
  • Japan
  • Germany
  • UK

Odds are that if you’re going or have gone global, then these are among your top markets. Notice that you’re already in the realm of a lot of multilingualism. The US has at least 7 languages other than English with more than a million people speaking them at home: 37m Spanish, 2m French/Cajun, 2m Chinese (not subdivided in the census but there are more Cantonese speakers than Mandarin speakers here), 2m Tagalog, 1m German, 1m Korean, 1m Vietnamese. Russian and Arabic are each close to a million US speakers.

In China, there are about 30-some languages with more than a million speakers (the number depends a lot on what you count as a dialect). The biggest are Mandarin, various Wu dialects, and Yue (Cantonese). The World Values Survey attempts to get at a representative sample of people in each country. Of the 3,002 people they interviewed in China most recently, only 10.96% said that they spoken Mandarin at home.

Japan really is Japanese (though they have various dialects), Germany has mostly German dialects and also about 2m Turkish speakers. Like most countries, the UK has a variety of languages but it really is a very English place (place from the Old English plæce and Old French place, ‘open space’, from the Medieval Latin placea from the Latin platēa, ‘broad street’ from the Greek plateia, ‘broad’).

Investing in languages

We can use advertising budgets as a proxy for how much companies are willing to invest in various markets around the world. eMarketer gives information for 22 specific countries and 5 regions. The ratio of ads/Internet GDP ranges from a minimum of 0.58% for “Central and Eastern Europe not including Russia” to a high of 3.81% for Australia. The average is 1.13%, median of 0.85%.

Let’s use this information to look at what a truly global organization would be like. A truly global organization would try to find opportunities across the world. So I run a bunch of simulations where we vary how big/small of an advertising/Internet GDP ratio (we’ll randomize within one standard deviation of the mean). We’ll multiply that by the actual number of Internet users per country and the per capita GDP for 177 nations.

In that case, 26.37% of advertising spending would be allocated to the US, 9.29% to Japan, 8.29% to China, 5.71% to Germany. As a next step, let’s take the top 50 and look at the major languages (those with at least 1m speakers in each country).

From a language perspective, a truly global company should be investing like this:

  • English 33.49%
  • Japanese 9.53%
  • Spanish 8.44%
  • Mandarin 6.33%
  • German 5.6%
  • French 5.58%
  • Portuguese 2.87%
  • Russian 2.62%
  • Korean 2.41%
  • Italian 2.06%

Arabic would be #11 on the list at 1.72% but this could be revised upwards depending upon how you deal with literacy in Standard Arabic versus one of the many spoken varieties. Depending upon what you’re trying to do, you may be fine with Standard Arabic, or you may need to invest specifically in, say, Colloquial Egyptian Arabic.

Fortune 500 marketing budgets tend to range between 9-17% of revenues for those that make >$10b and between 2-9% for those that make <$10b. Let’s randomize the marketing budgets accordingly. And now look at language investments.

Corporations with >$10b of revenue should probably be spending $431m on English-language projects, $123m on Japanese, and so forth—something like $27m on Italian.

The rest of the Fortune 500, who have <$10b of revenue, would be spending about $77m on English-language projects, $22m for Japanese, but still about $5m for Italian.

Obviously, the investment in a market has to do with competitive landscapes and other factors, so your individual company may be making particular pushes, whether in media buys, social listening, or other activities. But odds are that you could be doing a lot more to listen to and speak to customers around the world. The listening part, in particular, hasn’t always been possible. So let us know if you’d like to learn more.

Top 30 languages for marketi (2)

The case for even more diversity

One of the things we’re concerned about here is language diversity. (See our posts on Natural Language Processing for All Languages and Machine Learning for Medicine for more on the importance of this; see Economic Powerhouse Languages for more about under-served languages that represent particular opportunities.)

The majority of people in the world are multilingual. If you don’t happen to be, find some people you know who are. Ask them which language they use to pray, to curse, to scold, to love. When they are muttering under their breath, when they are shoring themselves up to do something tough, when they are talking about relationships that matter, when they are recounting a bad day—which languages do they use?

We have lingua franca languages for the workplace but these aren’t necessarily the languages of our hearts. If you want to listen and communicate with people, the answer may not be the most convenient one. But the fact that it’s not convenient may be exactly why the investment will pay of much greater dividends in terms of genuine insights and connections. We can help.

– Tyler Schnoebelen (@TSchnoebelen)


8 Olympic trends in the Russian blogosphere

12 Feb


What’s happening in the Russian blogosphere in terms of the Olympics? This post takes a look at the main themes that emerge from a text analysis that involved topic clustering, named entity recognition, and sentiment analysis of 588 posts to Russian blogs over the weekend. I put these all together to see what emerged as the most significant and interesting trends.

1. Сочи

If you don’t know Russian, you might squint at that word and say “Oh, ‘coyn'”. You obviously didn’t pay attention to the lengthy introduction to the Cyrillic alphabet that kicked off the Opening Ceremony. What is your problem with dream sequences?

“Сочи” is the word for ‘Sochi’, the city on the Black Sea that’s hosting the 2014 Winter Olympics. You know that. A new proper noun to learn is Krasnodar Krai. Despite appearances, that is not the intergalactic ruler that the Cosmonauts encountered in 1961 and just haven’t told us about. It’s the name of the region that Sochi is in.


But back to Сочи. The first letter is es, which comes from the Greek letter for sigma (it’s pronounced /s/). It isn’t related to the c we have in English. The next letter is basically an /o/ though (like in English no…that’s not quite right but close enough).

Then we get che, which is that /ch/ sound at the start of cha-cha and Tchaikovsky. Finally, it’s a и (sorry, that’s the cursive/italic, it looks like ‘и’ when it’s lower-case print). That’s pronounced like ‘ee’ except sometimes it’s a shorter vowel like in English in. Listen to Russians say this vowel and the whole word here.

2. The most popular sports

Here are the sports that get the most mentions:

  • Men’s hockey
  • cross-country skiing
  • biathlon
  • curling
  • snowboarding

Notice, of course, that mentions are a bad proxy for popularity. Snowboarding is mostly mentioned because there were a lot of people reporting on the very first gold medal handed out (to American Sage Kotsenburg for snowboarding slopestyle).


3. Positive for Putin

Most mentions of Putin are explicitly positive or implicitly positive, as with reports that he was happy (or ‘satisfied’) with how the Olympics opened. Here’s a question for your sentiment analysis tool: what does it mean to excerpt parts of news articles? The text within a news article may simply report opinions. Or it may be pretty thoroughly neutral. But what people choose to excerpt—even without comment—does often indicate a particular slant. I will not comment on the slant of the bloggers in this data set who simply excerpt, but I will say that for those who do give metacomments, they are very pro-Putin.

There are some naysayers who might see the entire Olympics as Putin’s диктаторские игрища. That translates to something like ‘dictatorial merrymaking’ (Russian speakers, your suggestions are welcome!) The first word really is the same root as English dictator (from Latin). If you look at the second word, the first three letters are ‘games’, in fact ‘games’ in The Winter/Olympic Games is “игр”.

4. Spectacles and evil hamsters

One of the largest groups of blog posts are about loving the Opening Ceremonies. My favorite comment is from someone who writes:


That first word is ‘it/this/that’. The second word is something like ‘impossible’ or ‘unreal(istic)’. The third word is related to the noun for ‘magnificence’ and ‘splendor’. But in one of the all-time great machine translations, Google Translate suggests: ‘THIS IS CRAZY GORGEOUS!!!’

I’ll talk a bit more about this in the next sections, but people who don’t agree with the awesomeness of the Opening Ceremonies are derided. In one case, they are called злобное хомячье, which is basically ‘evil hamster’. Or maybe ‘vicious hamster’.

5. Negative for foreign press

There are two ways the foreign press is cited (and both of these are prominent themes), one is to recount how amazed the foreign journalists are. The other is to talk about they are evil hamsters (see #4 above).

This dovetails nicely with another prominent theme—the idea of information warfare against Russia/the Olympics (my English is a pretty direct translation of “Информационная война против Олимпиады”). That is, #SochiProblems and other kinds of critiques are seen as an orchestrated, anti-Russia attack.

You get people translating @the_cheshirekat‘s tweet “Dystopia has never looked so fabulous”. That’s «Антиутопия еще никогда не выглядела так сказочно» in Russian. But now you get to see how ‘dystopia’ is basically just slightly different Greek origins than what we use in English, the Russian term is ‘anti-utopia’. The word for ‘fabulous’ has as its root ‘fairy tale’. (The English term comes from the French fabuleux, which has its origins in ‘fables’.) Btw, this.

One of the things you may have seen going around in the English-language press is the Russian word for Schadenfreude (although there may be some different nuances). The Russian term is злорадство, ‘malicious glee’ (zloradstvo, ‘evil’+’joy’). The Russian blogs don’t actually talk about this very much.

Of the four mentions of this ‘malicious glee’, one is annoyed about it in terms of the Olympics, one traces where it comes from, one is actually talking about US Republicans gloating, and the fourth is talking about how they won’t post pictures of miserable people (in the Ukraine) because he doesn’t want to participate in zloradstvo.

One of the bloggers talking about #SochiProblems has a pretty interesting critique. They say that rather than being trivial, #SochiProblems actually show the consequences of corruption so that what seem like inconveniences are actually just the surfacing of something much darker. But this particular Russian critique of Russia wasn’t that common.

Putin’s own counter-critique is

Ложечку дегтя в бочку меда подлить есть всегда желающие

‘There are always those who want to pour a spoonful of tar into a barrel of honey.’

That first word is particularly interesting because it ends in a diminutive—so it’s more like ‘a little spoonful’, maybe even ‘a cute spoonful’, embedding an extra bit of dismissiveness. Diminutive suffixes are common in the world’s languages (think of Spanish -ito and Italian -ino). Because they are often associated with children, they can be quite affectionate or be useful in hedging a request. But they can also take on other shades. Consider going up to someone in English and asking, “How’s your little project?” about something they’ve been spending a bunch of time on.

6. Not much about the gays

There are folks who mention boycotts but they don’t always mention LGBT issues. A popular strategy when gay rights are mentioned is to point out how hypocritical Americans are since there are states with anti-gay laws here. But in general, there isn’t much focus on them. A few phrases for your phrasebook, though:

  • содомской пропаганды ‘propoganda of Sodom’
  • гей-пропаганду ‘gay propoganda’

The word гей is also a homophone for what is basically an exclamative ‘Hey!’, like “Гей, славяне!” (‘Hey, Slavs!’)

7. Medvedev lullaby

One of the most popular (but short) postings was about Russian Prime Minister Dmitry Medvedev sleeping during the ceremonies. A typical headline: МЕДВЕДЕВ ВПАЛ В СПЯЧКУ, ‘Medvedev went/fell into hibernation’. As Benjamin Lukoff points out, the Prime Minister’s last name means ‘bear’ (even more fun, it’s literally ‘honey+eater’).


8. Performances

The artists performing at the Olympics are a common topic, the word “Артисты” (the Latin characters are something like ‘artisty’) includes artists, actors, entertainers.

One of the most frequent people blogged about was the Russian rock singer Zemfira, who said a song of hers was used without her consent during the Opening Ceremonies, in violation of her copyright.

The term is also used for captions of pictures of the performers at the Opening Ceremony. We’ve spoken of that some, but let’s take our hats off for the hats of Russia, whether they are the ones below or ushankas or Afghankas. We clearly all need more puppy dog ear hats as in the back row. Agreed?


Towards a conclusion

People who know English but not Russian still know a number of Russian words. But these words weren’t used very often this weekend.

The only word that you probably know that was used a lot was нет (‘nyet’, ‘no’). But there are just a handful of occurrences in this dataset for ‘babushka’, ‘gulag’ (which btw is an acronym in Russian), ‘pogrom’, ‘samovar’, and ‘perestroika’.

Broadway doesn’t make any appearances, but here’s a quote that may resonate for Russians and/or observers, whether you’re thinking of the personal or the geopolitical. From Tony Kushner’s Angels in America, Part Two, which was named Perestroika: “In this world, there is a kind of painful progress. Longing for what we’ve left behind, and dreaming ahead.”

Understanding language is ultimately about understanding people. If you’d like to learn more about about advanced text analytics, drop us a line.

– Tyler Schnoebelen (@TSchnoebelen)


Crazy good: More nuanced sentiment analysis

7 Feb


The word “crazy” is one of the most flexible in English. It can be an intensifier as in crazy good/crazy bad and it can be positive or negative when standing alone, this party is crazy!, these demands are crazy. There is often a pejorative use associated with mental illness, so it is a sensitive and sometimes offensive word by association. In this post, we are looking at its grammatical context and how that contributes to sentiment.

At the end of the post, you’ll also see a bit about how we automatically detect and remove phishing messages and other spam.

So what do people on social media think is crazy? Mostly events. More specifically: movies, sports, life, women, and Kevin Durant.

Crazy talk (1)

The unexpected

The unexpected gets attention—this is a pretty basic truth of cognition and culture.

But we have various kinds of reactions. There’s the unexpected that fills us with excitement and there’s the kind that we reel away from or that we use to socially patrol others and ourselves. In other words, there’s crazy-fun and crazy-unacceptable.

The wrinkle is that even crazy-normatively-objectionable can inspire titillation that we kinda like. Here are other emotional cues that introduce someone saying something is crazy in social media: holy shit, man, lol, damn, ohh shitttt, smh (‘shaking my head’), lmao, omg. Just because we’re communicating an intense reaction doesn’t mean we actually know what we think about it. Pure emotional states are rare.

Crazy is a good example of a word with a complicated social signal. A major motivation for this post is that there’s a lot more to sentiment than positive/negative/neutral.

What does “crazy” mean?

For people concerned about the stigmatization of mental illness, some good news: craziness terms are applied to non-humans 3.47 times more often than to people—mostly to events and situations. (Although this is also the case for pejorative gay, which is also applied to situations—that’s gay—so this may be a limited consolation.)

Having said that, there are 1.66 times as many references to women being crazy as men (for example, mom, girl, she vs. he, guy, dude). This is a long-standing imbalance, just look at the etymology of hysteria. Back to gender in a minute.

When someone or something is crazy they are unintelligible. In speech, it could be a failure to take listeners’ needs into account by dropping reference cues that are necessary to follow (“Wait, you’re using a pronoun but you haven’t introduced the referent!”). Crazy talk is also where you say things that aren’t socially sanctioned, like a soldier giving an order to his commanding officer or an invisible penguin. Disrupting the established social order can get you labeled crazy.

Sometimes doing the unacceptable is good. Sometimes not. The main movie that people said was crazy was the Lifetime remake of Flowers in the Attic. It features Ellen Burstyn and Heather Graham locking kids in an attic. The story gets messier.

Craziness also indicates that there’s no reasoning possible. In Apache culture, you tend to stay quiet when someone is enraged (hashkee) because they are crazy (bíni’édi̜h): they forget who they are and lose concern for what their actions do. Odds are you weren’t raised Apache, but this will still sound familiar.

The connection between (un)reasonability and gender is doing a lot of work. A lot of the sexist messages give no real content other than This bitch is crazy. A fair number of writers specify that they love this fact but more are probably using it as a critique. Most instances don’t have enough context surrounding them to actually let us tell what the authors meant. They may not know themselves.

The majority of women who are labeled as crazy are left as relatively anonymous. The men labeled as crazy are more specific: Kevin Durant, John Tortorella and Dynamo the Magician, in particular. During the January time period I grabbed this data from, Durant scored 54 points against the Golden State Warriors, a career high for a guy who is a phenomenal basketball player even without that game. His scoring was exceptional (unexpected), which led lots of people to say Durant is crazy. The ways we describe situations and actions can pretty easily sneak their way into the way we describe people.

Scrubbing spam, singing sexy

The first worry of a data scientist is “garbage in, garbage out”. Hence the importance of data janitor functions. One step I did for this analysis was restrict my analysis to users whose follower and propensity-to-be-retweeted counts were within two standard deviations of the average. That reflects an assumption that people outside of that range deserve a different kind of analysis. At one end, they are new users and spammers; at the other end, hyperpopular celebrities and news outlets.

But dropping the extreme ends doesn’t eliminate a specific kind of spam: phishing. These are the messages that are sent out by normal users when their accounts have been compromised. For example, there were over 98,000 messages of the form, “@phishingvictimfriend haha your blog is crazy http://evilurl&#8221;.

In our system, we do unsupervised clustering in order to group messages that are the most alike together. Those 98k crazy-blog spam messages have a variety of users addressed with the @ and a multiplicity of URLs, but they end up all grouping together. In this case, there are about 4 major spam clusters worth removing from consideration (so they aren’t reported above).

Automatic clustering gives you a sense of the data you’re dealing with, something that word clouds are fairly pathetic at. Clustering techniques allow you to get exemplars, the most representative messages for each cluster. This is where you can also see if there are problem zones like phishing and where you can see which messages are enjoying the widest circulation.

A huge number of the bitches is crazy posts are referencing a lyric from Lil Durk’s Bang Bros, which was released last October (I’d include a link to the YouTube video but it’s really dull and not worth watching). Song lyrics stick around for a while: folks are still tweeting out “This shit is bananas, B-A-N-A-N-A-S”, which comes from Gwen Stefani’s 2005 Hollaback Girl. That’s a song where she tells some dude to meet her at the bleachers for a fight. Watch it, Lil Durk.

Analyses of social media often need to decide what to do with song lyrics. The right way to do it is to ask the client what they want. For example, Granger Smith’s country song has the lyrics, “I wanna love you on a Silverado bench seat”. When people tweet that should it count as positive sentiment? Neutral? A case could be made for either.

Sexuality is important component for many brands. In our work, we’ve found that a huge part of how cars and trucks are evaluated in social media has to do with their sexiness or cuteness, including the sexiness/cuteness of drivers. So brand managers for Chevy and competitors probably do want to keep track of this sort of thing. (Product aside: we offer automatic detection of sexy-cute and lots of other dimensions like intent-to-buy that are more fine-grained than traditional sentiment analysis, check out the second row of our product page.)

To be honest, being able to decide this is kind of a luxury. This is not the kind of feature sentiment analysis tools typically have. They’ll just automatically count that lyric about the Silverado as positive (mostly because of the word love) but they’ll also count completely irrelevant Silverados like “I have to go to silverado tomorrow I might kill myself ??¿” That’s negative about Silverado, California not the Chevy Silverado. Drop us a line if you’re interested in learning more.

– Tyler Schnoebelen (@TSchnoebelen)