Wholesomeness, the whole picture

28 Apr


Words and cultures are dynamic—they change over time. Even something as basic as family. Family originally referred to the servants of a household, then an entourage that went around with a high-ranking person, then a whole household (parents, children, servants).

Marketing campaigns are meant to tie companies to cultural and personal values. In doing so, they reflect what various keywords have come to mean and they are part of how those words themselves change. A good example of this is Honey Maid’s recent campaign around “This Is Wholesome”, which features a two-dad family, a rocker family, a single dad, an interracial family, and a military family. The first video launched on March 10, 2014 (6,370,863 views on YouTube as of this writing).

There are several interesting dimensions to this campaign (1) What is the meaning of wholesome and how has it changed? (2) Since this was a deliberate marketing campaign, how can we measure if it was effective for Honey Maid? And of course (3) what are the awesomest other words that end in -some?

As you can imagine, applying wholesome to gay dads generated a lot of conversation and a fair amount of hate mail. A month after the original video, Honey Maid launched a follow-up that reaffirmed their position (3,785,968 views on YouTube):

Measuring effectiveness

It’s easy enough to boil down campaign effectiveness for Honey Maid: do people buy more graham crackers because of these videos?

But we can spell this out a little more. The basic thinking required is “counterfactual”. In other words, “if we hadn’t run this ad, would people have bought more/less/the same”? That requires having a handle on the various factors that contribute purchases to can see how much has to do with the ad. So measuring an ad’s effectiveness has a lot to do with measuring everything else going on. These factors are rather proprietary and because of the always-on consumer, “attribution” (which event led a consumer to actually buy something) is pretty tricky.

Because attribution is complicated, a lot of marketing departments use proxies, for example “brand awareness”. You can’t buy a product you’ve never heard about. An ad that gets people to learn about or remember a product is seen as a crucial step. It’s measurable by surveying people or by seeing how mentions in social media change. Of course, if everyone talking about your product finds it loathsome, you presume that sales will be negatively impacted. So some kind of sentiment analysis is usually necessary.

You can get fancier and detect intent-to-buy messages and track how those rates change. There are a lot of other Key Performance Indicators that marketing departments use. Rather than insist upon a single correct measure, I’ll suggest that the key is having measures that actually measure what you think you’re measuring. If you think you’re measuring sentiment but your accuracy is terrible, you’re not really measuring sentiment. If you’re only paying attention to English, you’re not really treating your brand as a global brand (English-only is even a problem for American brands: English isn’t the only language of the heart).

What’s a brand?

A brand is a relationship between a company and its customers. From the company’s side, the brand is promising specific benefits and value: “we’ll protect you,” “we’ll make new things possible”, “we’ll keep you healthy”, “we’ll show you a good time”.

A company that has a brand relationship with a consumer means that the consumer has some kind of emotional investment. If the consumer believes the company is faring well, they’ll have positive feelings; if the consumer believes the company is having troubles, they’ll feel negative emotions. Giving people the opportunity to care is part of creating a relationship. And caring is not always just joy-when-someone-else-is-joyful.

People who have seen Honey Maid’s ads understand that there are those who will be upset about showing a family made up of two dads. They know Honey Maid has taken a position. The ad campaign gives the opportunity to solidify a relationship with these people.

The other part of a relationship is that it’s mutual. You’re not in a relationship if it’s one-sided. So a company only has a brand if they are valued by consumers and if the company values its consumers. Another wrinkle: it’s not a relationship if either side just sees the other as “a means to an end”. That is to say, there has to be respect. When people talk about brands needing to stand for something, this is what they’re talking about. 

It’s not straight-forward to quantify taking a stand. Measuring how many more discussions happen after a company Takes-a-Position is a good first pass: a dramatic increase in mentions indicates that the company has hit upon something. But this could just be “We’re giving away a free car with every retweet!”.

A principle is only a principle if it costs you something. For that reason, taking a stand means attracting not just more mentions, but mentions that attract very-positive and very-negative. Taking a position can increase respect exactly because people believe you are willing to pay a price for a principle. From a financial perspective, a company hopes that the very-positive is made up of more consumers/possible consumers than the very-negative. But taking a position means taking a risk. If everyone knew the outcome beforehand, it wouldn’t really be a matter of Taking-a-Position.

If you’re actually Taking-a-Position, of course, you’re not doing something merely to make money. That means another measure of effectiveness is going to be whether you actually change the discourse around some topic. In a moment, I’ll talk about how the meaning of wholesome has been shifting but first, let’s look at what’s going on with Honey Maid.

What’s happening with Honey Maid

Honey Maid’s follow-up campaign (spoiler alert) mentions that they got ten times as many supportive messages and critical messages. How do things currently stand three weeks later?

Recent social media posts are overwhelmingly positive about Honey Maid (I’m reporting results for a bunch of different spellings of the brand though there are surprisingly few “Honey Made” misspells). About 43% of Honey Maid mentions are about the commercials. And of these about 89% of them are positive. The items that aren’t positive are largely neutral/newsy in tone.

There are very few negative items. The haters have essentially moved on, while supporters continue to circulate the positive message. Mostly people are sharing the video and expressing support. About 7% of positive mentions right now talk about it in terms of eating more Honey Maid—whether this is good really depends on what the typical response to ads is (both for Honey Maid and their competitors).

A history of wholesomeness

The main line in the original commercial is:

“No matter how things change, what makes us wholesome never will. Honey Maid. Everyday wholesome snacks for every wholesome family. This is wholesome.”

Honey Maid’s campaigns look likely to be effective from a marketing standpoint (we need to add some more variables to really show this). Is it effective in doing something about the discourse around wholesomeness? As with branding, we need to understand the history to understand change. Without a historical perspective, there’s really no way to answer questions of effectiveness.

The first uses of wholesome tended to be about ‘virtuous teachings’. In Wycliffe’s Bible way back in 1382,

The..holsum wordis of oure Lord Jhesu Crist. (1 Timothy 6:3)

(Modern versions treat wordis as ‘words’, ‘teachings’, or ‘instructions’.)

In other words, religious communities have been using wholesome as part of their discourse for hundreds of years. That’s what’s behind their notion that Honey Maid was trying to redefine wholesome (and family). But of course no one really owns the definitions of words and it is the nature of words to shift over time.

Importantly, the religious meaning really hasn’t been the single or primary meaning for most of the history of the word. One of the major uses of wholesome is about food/medicine—Chaucer talked about holsome herbes back in 1372. In America, the early 1800s were especially a hay-day for talk about wholesome foods. You see an uptake in talking about wholesome recreation/activities in the 1900s and we’ll see this in the contemporary social media, too.


Note that across web pages in countries where English is used, the virtue-related meaning isn’t very prevalent. The one exception is Sri Lanka, where wholesome is much more popular than anywhere else. In Sri Lanka, it is usually used as a translation for the Buddhist concept of kusala.

Okay, in a moment, we’ll trace what’s happening today in social media, but let’s go in for a little etymology. The whole part started out in English meaning ‘healthy’ in the sense of ‘uninjured’—originally you could only talk about people and animals being whole. The -some suffix has to do with turning other words into adjectives. Digression!


Really old forms:

  • friðsum (peaceful)
  • genyhtsum (abundant)
  • ánsum (whole)
  • langsum (lasting)
  • hýrsum, héarsum (obedient)
  • wlatsome (detestable)

That last one isn’t quite as old as the others—it dates back to the 1300’s when wlat meant ‘nausea’. Basically this is me trying to jump start wlatsome. As in, “Don’t click on this link about the wlatsome Kevin Swanson.”

Things we’ve kept on since really early English:

  • winsome
  • cumbersome
  • fulsome
  • handsome
  • loathsome
  • buxom (it’s hiding in the x!)

And from the 16th century, we get awesome and quarrelsome. After that we get adventuresomebothersomefearsome, and lonesome.

Wholesome and social media

Right now, a lot of the mentions of wholesome are for Wholesome Radio (36% of mentions of wholesome are in their tweets). Let’s remove these and the spam (about 11% of the data).

  • Food: 23% (but mostly not about Honey Maid)
  • Humans: 23% (and how they can/should live; church-related mentions are prominent)
  • Entertainment: 13% (movies, TV)

Now let’s compare this to 2011 uses:

  • Humans: 32%
  • Entertainment: 12%
  • Food: 9%

The data sets are not perfectly equivalent—the 2011 sample focuses on American English users who use social media conversationally, so the differences could be time or they could be that people who tend to be broadcasters/curators are more likely to talk about (wholesome) foods than people who mostly use social media to converse.

Complicated meanings

Now for something that cuts across categories: jokes, contrasts and irony. About 17% of the current mentions of wholesome are not, strictly speaking, about wholesomeness:

  • Ironic:
    • Thank you, Victoria’s Secret, for ruining what would otherwise be a wholesome picture. (Snooort!) #WhatIsHerSecret
    • Like a good wholesome strip club would put me in the most perfect of places.
  • Not-exactly-ironic:
    • Scary movie is over and to calm down we are watching something wholesome like X-Files season 6.
    • Why would anyone “unfollow” me? Got something against wholesome? Got me this far so buggar off! Golly. I feel so dirty now.
  • Contrastive:
    • @User I can see Rene doing it now! My DMs are all wholesome. Except you 😛
    • After wholesome scripts that I’ve written for school and other projects, I just can’t believe I’m writing these kind of lines. Hahaha!

In the 2011 data, the percentage is more like 46%. Again, the change may not be that we’ve gotten less ironic but that intensely-conversational social media users are simply more ironic/contrastive.

Regardless, the point is that modern discourse about wholesome has a very strong thread that plays with the concept instead of taking it at face value.

Metaphoric s’mores conclusion

S’mores (short for some more) used to be called heavenly crisps. The recipe is simple: Graham crackers give structure to an interior of chocolate and gooey marshmallow. Structure and goo. That’s basically meaning.

The structure of our languages and our social lives *seems* to come from “on high”…but really that structure is built out of individual actions. This is a squooshy, sticky notion sometimes called duality: your choices are defined by structures outside of you, but those big stiff-seeming structures only exist because of a billion individual choices.

The meanings of wholesome in the present moment come from a history of people using it day-by-day, year-by-year. The future of wholesome is the same. Patrolling word use is an attempt at language control in order to control thoughts and ideas. Basically, it doesn’t work. There’s structure around the gooeyness but think of what happens when you press too firmly on it. Is a melted marshmallow well-behaved? A melted marshmallow is not well-behaved.

The reason you want to structure unstructured text is because you want to understand patterns. Patterns are useful because they describe what is and help predict what might come. And while insisting on control is a losing proposition, shaping is possible. Our individual choices reflect what is. And they are part of making it so.

Earlier I defined brands as a kind of relationship. Relationships also involve historical patterns of actions and beliefs. But more than that, they involve a kind of meta-approval: a kind of normative desire that says, “Yeah, I don’t just have a history of these actions and beliefs—I believe in them.” We have all sorts of habits and beliefs about various people and organizations in our lives, but the real relationships are the ones we would endorse.

Cultural keywords are disputed terms. By staking a claim about the environment, privacy, family, freedom and the like, companies position themselves. Consumers and potential consumers will align or depart, but it is the very act of staking a claim that makes it possible to have meaningful relationships. A relationship that you could literally have with anyone is not really a relationship at all. Floating is space, unattached can save you the pain of rejection. And you can be well-regarded from afar. But relationships risk, require, and reward more.

Name your next product Xygrzflyt

17 Apr


Turn these words around in your mouth: fiestapoloeonjazzriorush. They are all great words. They are also the names of some of the top-selling car models from Ford, Volkswagen, Hyundai, Honda, Kia, and Toyota. On the one hand: cool. On the other hand, ARGH.

If you think of your organization’s name, its products, its services, and its features, odds are at least some of them are common words (unless you’re in pharmaceuticals). But while Honda brand managers may really like Sonny Rollins’ “Saxophone Colossus”, they don’t usually want to be tracking its popularity when they are looking at social media or other communications. But even in Japanese, jazz = jazz = ジャズ = ジャズ.  

It’s the nature of words to be ambiguous—when Sprint actually is about the telecommunications company, it’s still sometimes about services and sometimes about the corporate entity. If you’re doing something like sentiment analysis, you might lump them together, but you might want to keep them separate just as you might want to keep paid/owned mentions (by customer support and corporate brand managers) separate from organic items (from everyday folks who aren’t paid to talk about Sprint).

The key is to be able to carve up the world into categories that are meaningful. It’s only when categories are meaningful that you can get insights and take actions. Disambiguation is a crucial requirement for understanding.

Ambiguity, relevance and 3 big brands

In this blog post, we take three common words and show how important relevance is: SprintTesla, and Procter & Gamble’s detergent, Tide. I’ve focused on English and already removed spam. (Here’s some recent stuff on French and Spanish, Korean, Russian, and some non-English punctuation marks you need to start using.)

Sprint for the door if Tesla

For Tesla, the proportion of tweets having to do with the car company and its vehicles has held pretty consistent. The rate of mentioning Nikola Tesla and tesla coils is relatively stable.

The relevance rates for Tide and Sprint go up and down with the football season (the Crimson Tide at Alabama) and work/exercise cycles (software sprints but more often, running/races).

Another reason why you need an adaptable system: you have to figure out whether news about, say, the NASCAR Sprint Cup counts as having to do with Sprint or not. The company is investing in the competition but when someone is excited about Joey Logano winning, how relevant is it to brand managers? It’s a matter of taste and what business questions you want to answer. It’s irrelevant if you just want to know how people feel about network coverage but it’s relevant if you’re tracking marketing campaign reach. Defining what counts and having a tool that can handle your custom definitions is important.

It’s also worth noting that something like sentiment isn’t randomly distributed. The bulk of irrelevant Tesla’s are about inventor/scientist Nikola Tesla. Tweets about Tesla cars are fairly positive but if you don’t disambiguate and get rid of Nikola Tesla references, you’ll end up with a sense that people are far more positive about the cars than they are. People who talk about Nikola Tesla almost always love him. Mustache.

Capital-S people and Lowercase-S people

In the case of Sprint and Tide, you could say that people who capitalize are more likely talking about the right thing. For Sprint, relevance goes up to 73.80% when you just use the capitalized form, while only 43.56% of people mentioning sprint are talking about the company and its phone services. But if you restricted yourself to only the capital, you’d be missing a lot of data: you’d be getting only 63.97% of all the conversations you want if you ignore lower-case sprint.

The real problem is that capitalization conventions are not randomly distributed across people. That is to say, the types of people you get talking about Sprint are different than the type talking about the same company/services but referring to them as sprint.

The people who use the capital-S tend to be active in technology: they keep blogs, talk about science/gadgets, a lot of them identify as husbands/fathers. The lower-S users who talk about sprint tend to be younger, swear more and talk more about travel and sports. If you’re a brand manager who only cares about developers and geeks, then you’re fine just looking for people who use proper Sprint capitalization. But you’re going to be missing a lot of data and a lot of the variety of perspectives…which is the whole point of tapping into social media.

Conclusion for marketers and data scientists

Whether you’re doing sentiment analysis, Named Entity Extraction, intent-to-buy or key influencer tracking, you need to make sure what you’re structuring is relevant to your needs. You may want to track swim sprints, tesla coils, and high tides but you want to do that on purpose, not by accident. Don’t be satisfied with helter-skelter analytics: garbage in, garbage out. There’s a lot of language out there that’s worth understanding and there are tools that work. Drop us a line.

Social media more complex than great literature

11 Apr

{Originally written for}

The Washington Post recently published an article about online/offline reading differences. Are our reading abilities changing?

“They cannot read ‘Middlemarch.’ They cannot read William James or Henry James,” Wolf said. “I can’t tell you how many people have written to me about this phenomenon. The students no longer will or are perhaps incapable of dealing with the convoluted syntax and construction of George Eliot and Henry James.”

So this post tries to answer the question:

Is social media harder or easier to read than Henry James and George Eliot?

Now to refresh your memory, here’s the opening of George Eliot’s Middlemarch:

Who that cares much to know the history of man, and how the mysterious mixture behaves under the varying experiments of Time, has not dwelt, at least briefly, on the life of Saint Theresa, has not smiled with some gentleness at the thought of the little girl walking forth one morning hand-in-hand with her still smaller brother, to go and seek martyrdom in the country of the Moors?

I mean, just yesterday.

Syntax is dead and also very interesting

A dirty secret of computational linguistics is similar to the WaPo article: “syntax is dead”. Except what computational linguists mean isn’t that syntax is really dead but that we avoid actual syntactic parsing in our machine learning models, getting the same information extraction accuracy with much simpler features, like sequences of words, clusters of words and phrases, and their relative positions in sentences.

But that doesn’t mean there isn’t a lot of insight to gain from thinking syntactically. I’m not going to directly quote the opening of Henry James’ The Ambassadors but let me recommend Ian Watt’s 1960 essay about its first paragraph. The Ambassadors has a boring beginning, which even Watt grants. But under his careful explication you come alive to the subtle humor and the way the author has knitted together ambivalences and bewilderment. Importantly, Watt mentions the difficulty of Henry James three times. That’s a literary critic, professor of English at Berkeley (later Stanford) saying James was hard. In 1960, long before the Galaxy, the iPad, and the Kindle.

One of the reasons for the special demand James’s fictional prose makes on our attention is surely that there are always at least three levels of development—all of them subjective: the characters’ awareness of events: the narrator’s seeing of them; and our own trailing perception of the relation between these two.

Of course students today have difficulty with Henry James. People have always had difficulty with Henry James.

Watt’s essay is famous in literature circles because it is such a careful textual analysis. From a linguist’s perspective, it is rich with hypotheses of how language works. Watt’s main points are that James is being humorous as he introduces his protagonist but also compassionate. Here are the linguistic resources he mentions:

  • Sentence lengths and how they change
  • Delayed specification of referents
  • A preference for abstraction: non-transitive verbs, abstract nouns
  • An enormous use of that (see also Who is the Sarah Palin of the Canterbury Tales?)
  • Variation to avoid piling up personal pronouns
  • Lots of negatives and near-negatives
  • Odd placement of words
  • A density of parentheticals (the last sentence of the first paragraph, in particular)

In general, if someone says their system can automatically detect irony or sarcasm, you should grab your wallet and run. But Watt has an interesting observation (in a similar spirit to Maynard Mach in 1948): “The application of abstract diction to particular persons always tends towards irony, because it imposes a dual way of looking at them: few of us can survive being presented as general representatives of humanity.” As he later develops, the classic posture of irony is to know something about someone that is more than they know about themselves.

In general, this suggests the importance of tracking “distancing devices” in language, or more generally, the way people use language to position themselves, their topics, and their audiences.

Tweets are harder to read than George Eliot

Syntactic parsing is a good way of measuring complexity but it’s difficult to get it to be accurate and is especially unreliable for tweets (the top compling parsers are listed here; for Twitter, the best resource is CMU’s Twitter part-of-speech tagger).

Psycholinguists have long known that long strings of long words are hard for people to read. So rather than operationalize “difficult syntax”, I’ll take an easier route and use the Flesch Reading Ease metric of readability. Idibon also has other metrics of readability, standard ones and ones that take into consideration word frequency since common words are easier to read than uncommon ones, and some machine-learning driven readability metrics of our own. But for this post, we’ll stick to Flesch so that you can repeat our experiment at home.

I picked out the 10 works of fiction that are the most popular on Project Gutenberg right now, added 5 of the most popular Henry James and 5 of the most popular George Eliot. I also grabbed 400,000 random tweets and segmented them into sentences (a tweet lacking any punctuation was considered a full sentence).

This project also had me reflecting on great beginnings, like Toni Morrison’s beautiful (go read it now) Beloved:

124 was spiteful. Full of a baby’s venom.

How do you make a great beginning? “Make the subject of the sentence an obscure sequence of numbers to get the reader’s attention. In case that doesn’t work, follow up with a terrifying, baby-related metaphor.” (More advice here.)

To handle this, I grabbed the opening sentences from 837 different novels. Note that I was strict so even though a lot of beginnings require an interplay of first and second sentences, by beginnings I really mean one sentence and one only.  A more comprehensive study would look at the relationship between lengths, as Watt mentioned (see above).

Back to algorithms. It’s important to notice that a one-word sentence like “Impossible!” gets a terrible Reading Ease score because it’s got so many syllables in such a small space. “Sure.” is easier to read than “Impossible!” but it doesn’t strike me as something that ought to really disadvantage an author that much. So what I’ll report are buckets of word counts: Flesch Reading Ease scores for medium sentences (4-8 words) and for long ones (9+ words). The infographic also shows the percentage of short sentences (1-3 words) and hard sentences (Flesch Reading Ease less than 50).

Among the easiest to read authors are the Brothers Grimm, Henry James, Mark Twain, Lewis Carroll, and Sir Arthur Conan Doyle. But there’s some variation there: James’ Wings of the Dove has very easy to read medium-length sentences but its long sentences in volume 1—but not volume 2—conjoin and detach phrases for 100 words after they begin (search for She wore her “handsome” felt hat or With which he had it again all from her here).

The hardest to read are Mary Shelley, Victor Hugo, Jane Austen, Franz Kafka (David Wyllie translation), and George Eliot.

The sentences with the lowest Reading Ease scores generally go to Victor Hugo (Isabel F. Hapgood translation), which has a number of 300+ word sentences. Those are so rough that the Reading Ease scores for them are negative. This isn’t just a French or French-in-translation thing, even Mark Twain has some 200+ word sentences in Huckleberry Finn.

And tweets? Tweets belong somewhere in between. In terms of long sentences (9+ words), tweets are super-easy to read. That’s partly because if you’ve managed to fit 9 words within Twitter’s 140-character limit, odds are that each word is short.


But in terms of 4-8 word sentences, tweets are among the hardest things to read, sandwiched between Kafka and Austen. Keep in mind that #hashtagsareconsideredtoughtoread. This is basically true whether we use tweets-as-they-are or if we only consider tweets that don’t have @’s or links since those tend to be longer strings that sort-of-shouldn’t-count (the subset of tweets that don’t use @ and don’t have links is about 100,000 tweets big).

But let’s look at how often you’ll come across a difficult sentence in these things. I’ll define a sentence as difficult if it has a score of 50 or lower. In that case, we can see that 34% of Frankenstein‘s sentences are difficult—mainly because there are some really long run-on sentences.So you’ll encounter tough sentences most frequently in Frankenstein and second in Pride and Prejudice where 31% of the sentences are hard.

If you read through a bunch of tweets—ignoring the ones with @’s and links because those make tweets look especially difficult to the algorithm—then about 11% of the tweets you encounter are difficult. That’s comparable to The Adventures of Sherlock Holmes (11%) and a bit harder than Alice in Wonderland (9%) and Huckleberry Finn (8%). Note that only 14% of the sentences in Henry James’ The Ambassadors are tough-to-read. For what it’s worth, James’ average percentage-of-hardness is 16% across the five works I looked at, George Eliot’s is much higher: 25% of her sentences hard.

As for the opening lines, of the 837 first sentences I looked at about 18% of them are tough (for English majors: the opening sentences of authors that Harold Bloom considers canonical are especially long and complex).

Readability isn’t just a measure of how hard/easy something is for cognitive processing, it also tells us something about author and audience. Check out Dan Jurafsky’s work showing how readability distinguishes the language on potato chip packaging (fancy potato chips have fancier sentences).

Final words

Some authors are tough to read. And they have always been tough to read. The process of reading a book or a Twitter stream is a process of familiarization. So difficult books—and Twitter—can actually teach you how to read them as you go along.

Struggling towards understanding is not necessarily a bad thing; it can be beautiful and enriching to find one’s way through a dark wood. But there are times when accessibility is crucial—think about healthcare issues or that complicated email you fretted over last week.

In addition to sentence/word-based metrics for readability, Idibon also can report frequency-based measures since psycholinguists have decades of evidence that more frequent words are easier to process than rare words. Our clients have used readability metrics to automatically categorize comments on websites as well as as a feature to predict helpfulness of comments (and other kinds of business-specific needs).

Henry James despised sentences that were a “mere seated mass of information”.  Without syntax, words would just sit there like lumps. It’s syntax that gives them backbone. There’s meaning in the way they come together.

Entrepreneurs and empresarios: trends in English, French, and Spanish

2 Apr


Au Revoir, Entrepreneurs–that’s the title of a recent piece in the New York Times by Liz Alderman, though the gloom about entrepreneurship in France sounds less like the ‘see you later’ of au revoir and more like an adieu ‘for a long long time’.

This blog post is about how ‘entrepreneur’ is used in English, French, and Spanish (entrepreneur for the first two, empresario for Spanish, both words come from Latin ‘to take/seize’, although the Spanish word had a fun knight life in the 14th century as emprise, ‘chivalrous endeavor’).

Words and ideas have different ways of traveling through the world. If you only looked at English social media on entrepreneurship you’d get a fairly rosy picture. But you’d miss the fact that this French word is actually rarely used in French. And you’d miss all the negativity around the term in the Spanish Twittersphere.

Cultural nuances matter—even if you only speak English, you may sense that tycoon, mogul, and industrialist can have a flavor of ‘ill-gotten gains’ that English entrepreneur doesn’t seem to have for most English speakers.

The same basic idea applies to brands: their positions differ from subculture to subculture. What it means to drive a Jeep, wear Gucci, shop at Old Navy, or carry a Saks sack isn’t uniform. In this post we’ll be talking about entrepreneurs but the diversity here applies about brands, products, and features, not just cultural and economic keywords.

That data behind this report were gathered between March 11 and April 1, spam filtered and restricted to authors who consistently tweeted in one of the three languages investigated. In the end, that’s about 100,000 tweets made up of English (71.5%), Spanish (26.0%) and French (2.5%).

French: An absence of ‘entrepreneur’

34% of all of Twitter is in English, 12% in Spanish, and 2% in French.


Image from MIT Technology Review

So if everything were equal, we’d expect English to have 2.8 times as many mentions of ‘entrepreneur’ as Spanish and about 17 times as many mentions as there are in French.

The Spanish counts are actually a bit higher than we’d expect (25,586 instead of 24,805). But the French numbers are way off. We’d expect 4,134 tweets about entrepreneurs in French but instead there are only 2,490. That’s a very big difference.

What are the French tweeting about when they do? Like the English tweets, the French tweets deal with advice, success, and investors. There’s more talk about processes, trust, politics, and creation/development in the French tweets than in the English.

They do not talk much about marketing compared to the English tweeps. And there’s very little mention of social, either as a keyword or the names of major social media services/topics/techniques. And while the French tweets talk about investors, they don’t really talk much about founders or startups, which are fairly big entrepreneurship themes in the English tweets. But as you’ll see below, the French authors do talk about these topics, they just don’t talk about them alongside the word ‘entrepreneur’ as the English authors do.

Within the NYTimes article, there’s the idea that failure isn’t tolerated in France. Various ways of talking about failure are absent from the French data. Even rater (‘to miss/fail/mess up’) is really only used to promote a popular article about six Massive Open Online Courses that entrepreneurs shouldn’t miss (Les six #Mooc à ne pas rater pour l’entrepreneur). In the English tweets, failure is mentioned much more often and with a much greater diversity–a lot of them specifically address fear of failure.

Finally, I’ll note that the bad press seems to have stirred up something. Consider the last couple of days: the NYTimes article isn’t circulating around the French tweets right now but the counts of entrepreneur tweets are basically at the same rate as the English ones. It remains to be seen if this activity level is maintained or if it returns to pre-Mar 22nd levels.

Spanish: Lots of negativity

There are about 5 major trends in the Spanish social media posts about entrepreneurs. These themes account for about 70% of the data (a much higher percentage if you allow each retweet and almost-identical-tweet to count separately).

  • Politics
  • Government-entrepreneur interaction (laws, meetings between government representatives and entrepreneurs)
  • Useful information for entrepreneurs (tips, events)
  • Irregular business activities (corruption, fraud)
  • Insecurity (kidnappings, murders)

The dividing line between categories depends a lot on what use case you’re pursuing. For example, do you want a category that is separate from politics that handles news about economic policy and/or critiques of free trade agreements? For more about the importance of solutions that adapt to particular organizational questions, check out our post on what counts as a person in Korean.

Here are some examples from the top three categories (which were about 55% of the data).

  • Politics
    • Un país donde los empresarios quieran venir y no irse, eso es lo que necesitamos. (‘What we need is a country where entrepreneurs want to stay instead of leaving.’)
    • México necesita candidatos comprometidos con su pueblo que escuchen las voces de todos, no solo de la cúpula de su partido y empresarios. (‘Mexico needs candidates committed to the people; candidates who listen to everybody and not only care about their party’s elite and entrepreneurs.’)
  • Government-entrepreneur interaction
    • Presidente paraguayo recibe nueva comitiva de empresarios brasileños (‘The president of Paraguay meets a new delegation of Brazilian entrepreneurs’)
    • Alcalde se reúne con empresarios japoneses (‘Mayor holds a meeting with Japanese entrepreneurs’)
  • Useful information
    • Todo listo para la mayor feria de tecnología para microempresarios … – (‘All set for the biggest technology fair for micro-entrepreneurs’)
    • Lectura imprescindible para autónomos o empresarios: “Se acabó lo de trabajar gratis” vía @TriaNico (‘Must-reading for freelancers or entrepreneurs: “Working for free is over”‘)

It’s probably worth noting that an alternative to ’empresario’ would be ’emprendedor’, depending upon what Spanish-speaking country you’re a part of. If that’s your word, you might like to have a chuckle at how Real Academia Española defines emprendedor.

English: brands and entrepreneurs

I’ve been sprinkling in trends in the English data throughout the post (more in the last section, which extends the notion of “demographics”). But one of the things that stands out in the English-language data are proper nouns (“named entities“). In particular:

  • SXSW: emerging technology news from and for entrepreneurs at South By Southwest 
  • Dubai and #KSA: mostly about digital nomads and various tax incentives from the UAE and the Kingdom of Saudi Arabia
  • Twitter: a lot of these are about young entrepreneurs and leadership
  • Google: lawsuits, acquisition rumors, ways of thinking and and Google services that entrepreneurs can use
  • LinkedIn: linked to startups, strategy and marketing, in particular
  • Etsy: focused on smallbiz but I’m tempted to call a fair amount of these spam

Note that I did restrict myself to tweets from authors who are consistent in their language use. But there is plenty of multilingualism. That said, English speakers, especially from America, tend to underestimate language diversity. The chart above may have surprised you. But also consider the chart below, which shows how limited of a picture you’d get on social media listening in most countries if you only had access to English.


Image from Mocano et al (2013)

For more about language diversity check out our earlier post estimating how much a global organization should be spending on major world languages: English isn’t the only language of the heart.

Beyond simple demographics

Basically any time you’re giving themes, you should also ask about the humans behind the patterns. A typical way of doing this is to report traditional categories like gender, race/ethnicity, location, and income. But carving the world up in this way presupposes that these dimensions are more important and relevant than the kinds of things people actually do and say. The world is a lot more complicated than “Women 30-45” and “Affluent Washington suburbanites”.

Let’s talk about the social roles that these authors use to describe themselves. English-language authors tend to write and think for a living–unlike the Spanish and French tweeps, the English ones identify themselves as authors, speakers, writers, bloggers, and strategists. They are also coaches, which almost always means something like a life coach.

Meanwhile, the people who talk about entrepreneurs in Spanish are journalists, students, lawyers, and engineers (periodistas, estudiantes, abogados, ingenieros). The French-language authors are less likely to identify themselves by job titles, although there is a subset of présidents.

The French authors prefer to mention what they’re up to, for example stratégie and webmarketing. Both the French and English authors talk about being consultants and both groups have clusters of people interested in management, innovation and startups, though only the French authors identify themselves as founders. There’s a sizable number of both English and French authors who are part of groups that help young entrepreneurs/leaders.

English-language authors tend to mention specific family roles–mom, wife, husband, father. The Spanish-language authors also talk about being padres and about the importance of their familias, but there are overall fewer mentions. Family relationships aren’t a prominent part of how the French authors style themselves.

There seems to be more of a wall between public and private for the French authors. For example, while the English authors talk about their love of food and both English and Spanish authors talk about music and about being fans (fan, enthusiast, hincha) about various things, the French authors are much more professionally focused. It is also partly the case that the French interests are more diverse and therefore don’t cluster into sizable groups the way the English and Spanish authors do.

Politics is a much bigger deal for the Spanish and French authors. In the case of Spanish authors, there are clusters of people who refer to themselves as socialistas and revolucionarios (which if you don’t speak Spanish is still basically what you think it is). Since a fair number of the authors are from Venezuela, they also talk about being Chavistas.

The French approach is less specific, people talk about politique in more general terms (politique varies between meaning ‘policy’, ‘political’ and ‘politics’). Because there are a fair number of people from Québec, there are a few who do talk about being séparatistes.

In more general terms, the English authors talk about women, ideas, success and are often focused on the local in their profiles. French and Spanish authors are more interested in economics and commerce than the English authors. And while everyone mentions businesses, only the French talk about small and medium businesses (PME abbreviates petite et moyenne entreprise). The Spanish authors have a lot of words around media.


It is not enough to translate between languages: you need to understand the cultural context in which a word is spoken to truly analyze its meaning.

Imagine that you wanted to promote something about entrepreneurship to all of these authors. If you only looked at tweets that included keywords like ‘entrepreneurship’ you’d get some sort of sense of the people you’d be appealing to. But you’d miss out on the fact that no one’s tweets about entrepreneurship stand alone. They are tweeting a variety of other things and even have interests in some topics that they rarely tweet about. How do people make meaning in their lives? You can’t just answer by looking at only the small sliver of their lives that you have a keyword for.

