August | 2013 | Corpus linguistics

Archive | August, 2013

Naming camps and corporations

[sociable]

(There’s a game at the end of this post.)

Entities in the world go by many names. For example, if you stay at the Westin, the Sheraton, or the W, you may have a Starwood card. So you may even know that they are part of “Starwood Hotels“. But you probably don’t go around talking about the full name “Starwood Hotels and Resorts Worldwide, Inc.” all the time. Even in semi-official documents, the “Inc” might get dropped. Then again, it might get spelled out.

Similarly, you may have a perfectly fine full name that few people address you by. Even your first name may have variants. And sometimes you drive off into the northern Nevada desert and are hailed as Goddess Sparkle Pony or you answer to the appellation GLUTEN.

Wait, what?

Here’s the deal, everyone. This week San Francisco is a bit empty because so many people have gone to Burning Man—a kind of chaotic community/art/party thing. 68,000 people are there. You can check out its governing principles, but my favorite description is “Dadaist temporary autonomous zone”. None of us working at Idibon are in the desert but we salute all the makers and creators, whatever degree of dusty or dapper.

At any rate, rather than just tell you about automatically detecting named entities and disambiguating them, I’m going to compare and contrast Burning Man camp names with the names of organizations in the S&P 1500. And since Seamus Heaney just died, I’m going to focus on alliteration.

For the data, I’m analyzing the 910 Burning Man camps that have publicly registered. Most of these camps have names that are two or three words (median words: 2, median characters: 15).

There are 143 one-word camps (Mystopia, DISORIENT, Homojitoville). That’s 15.7% of all the names.

How about corporations in the S&P 1500? While I can take the Burning Man camp names largely as-is, I decided to clean up the corporation names. For example, I’ve ditched “Corp” and “Inc”. 463 of the corporations have one-word names. These are things like Accenture, AutoZone, Clorox, Comcast, Macy’s, McDonald’s, Safeway, and Starbucks. About 30.9% of corporations in the S&P 1500 have essentially single-word names.

Since creativity is a huge component of Burning Man, we expect names may be more likely to be rhyming, punning, or alliterative. Let’s look at alliteration: how typical are Barbarella Bootcamp, Santa’s Summer Sleigh, Pretty Pickle and (my favorite), Blood Bath & Beyond?

Let’s start with the two-word camps. There are 335 of these. I’m going to be medium-strict on what counts as alliteration. We’re going to use sounds, so although Chainsawmargaritaguys Camp has a two words and both begin with “c”, one is a /ch/ sound and the other is a /k/ sound. Doesn’t count. Meanwhile, Swing City does count since both of those initial letters are pronounced as /s/. If I wanted to be even stricter, I would insist that alliteration only counts if it’s on the stressed syllable. But I’m not going to do that here.

How many of the two-word Burning Man camps alliterate? 43, which is 12.8% of all the two-word camps. Doesn’t seem that high but it’s twice the rate of the S&P 1500. Of the 668 two-word corporations, only 37 alliterate (5.54%). These are things like Aqua America, Best Buy, and Coca-Cola—I treated hyphens as spaces throughout. By the way, Aqua America is utilities not water parks. Disappointing.

There are 231 three-word Burning Man Camps. For alliteration here, I’m flexible and count first-and-third-but-not-necessarily-second. So not just Porta-Potty Pigs and Karma Collectors Camp, but things like Broken Angel Bathhouse and Burners Without Borders, too. 27% of three-word camps involve some alliteration.

The S&P 1500 have 293 three-word corporations. The corporations—or Standard and Poor’s postmodern poets—don’t like alliteration all that much. Only 12.6% of the three-word corporations have alliterative names. So congratulatory kudos to Martin Marietta Materials, Fidelity National Financial, Johnson & Johnson, and Wolverine World Wide (although I am not comfortable with “Wide”).

I’ll wrap up with a quick overview of the sounds that Burners like best. They’ve already been coming through. There are 28 names using the /k/ sound and 23 using the /b/ sound. You’ve seen some already, here are some more:

Camp Canadianderthal
Cartoon Commune
Celtic Chaos
First Kiss Café
Conscious Monkey Clan
Camp Kegel Kommandos
Botanica Bodhi Manman nan Bejeezus
Barechested Baristas
Bubbles and Bass

The /k/ and /b/ sounds get a boost from the fact that Burning Man occurs at Black Rock and the groups are thought of as camps.

For corporations, /s/ is the winner (but only 11), followed by /k/ sounds (10 of them) and a bit more distantly by /b/ (7).

Sovran Self Storage
Spartan Stores
Kansas City Southern
State Street
Calgon Carbon
Cooper Companies
Bed Bath & Beyond
Boston Beer

A name game

I’ll send you off with a game. “SmallCap corporation or Playa playground?” I feel like I have neglected vowels. So I’ll just stick to “a”s.

Aaon
Abaxis
Aegoin
Arctic Cat
Almost Family
Aar
Allete
Amsurg
Arbitron
Arqule
Atmi
Actuant
Avista
Aerovironment
Anixter
Azz

Happy Friday!

– Tyler Schnoebelen (@TSchnoebelen)

[sociable]

Comments Leave a Comment
Categories Uncategorized

The funny thing about repetition

10 Aug

[sociable]

Bursts of emotion are one of the hardest aspects of speech to capture in written text. It is difficult to capture exuberance and immediacy when you are allowing your readers to read your utterances at their own pace. But these sudden explosions of sentiment are also often the most interesting to track and analyze, giving us insight into a writer’s emotional expressions. Repetition is one of the simplest ways that people overcome these limitations on writing, especially when it comes to expressing laughter.

How many kinds of laughter are there? Some laughter creates cooperative, positive relationships, like friends or mad scientists. Laughter can also be divisive, like when it’s at someone’s expense. You’ve heard laughter that signaled anger, anxiety, hostility. And you’ve laughed to release tension. Laughter can be part of self-deprecation, appeasement and submission. And of course it’s often part of showing sexual interest in someone. So with all these great possibilities, it’s kind of lousy that you can’t text someone a condescending chortle or a genuine belly laugh. But ha! you can.

Using a sample of 9,212, 118 tweets from Twitter users, we see that 749 people use “repeated laughter” like ha ha ha or hee hee hee. The vast majority of people are loyal to a single form of laughter: 682 of the 749 use only a single form. That is, there are about a dozen ways of laughing where the laughter is repeated 3 or more times, but individual users tend choose a single form and stick with it.

In terms of the exceptions, there are 67 variety-is-the-spice-of-laughter people in the sample. In general, these are folks that are using ha ha ha (the most popular kind overall) with one other form. Of the 60 people using ha ha ha and another laughter form, the most popular is he he he (30), followed by hee hee hee (18) and heh heh heh (10). For the handful of people that do multiple things but don’t use ha ha ha, the favorite pairing is hee hee hee with heh heh heh.

Let’s zoom out to look at all kinds of repeats where someone tweets the same word or emoticon three times in a row or more. Once we expand beyond laughter, we see that 8,072 different users repeated words three or more times in a row—in other words about 56% of users do this.

Most of the words that get repeated three+ times are monosyllabic words (wait wait wait, hate hate hate, work work work). The next most frequent category are sounds (na na na, ding ding ding, da da da), followed by multisyllable words/squished phrases (really really really, very very very, #teamfollowback #teamfollowback #teamfollowback). There aren’t that many repeated emoticons (sad faces, smiley faces…there was also one user who loves ϟ ϟ ϟ so much that she used them in hundreds of her tweets).

There are also 671 users who use “. . . “. People who use these seem to be doing basically the same thing as ellipses (…), just with extra spaces. It doesn’t seem to be the same thing as saying work work work or 🙂 🙂 🙂, so I haven’t included them in the chart above.

Most of these users have just a couple of different things they repeat. The average number of different words that tweeps repeat is 2, the median is 2.8. Besides laughter, my favorite semi-variants are yum yum yum and nom nom nom (8 users do both, 353 use just nom nom nom, 51 people use just yum yum yum). There was a vote to make nom nom nom Word of the Year for the American Dialect Society a few years ago and voting against it is one of the greatest regrets of my career in linguistics. As Maryam Bakht said, a vote for nom nom nom is a vote for joy.

– Tyler Schnoebelen (@TSchnoebelen)

[sociable]

Comments Leave a Comment
Categories Uncategorized

The dirty hands of data scientists

2 Aug

[sociable]

Welcome to this post! Now go read Harlan Harris, Sean Patrick Murphy, and Marck Vaisman’s 40-page book, Analyzing the Analyzers.

The goal of Analyzing the Analyzers is to reduce miscommunication for what is meant by “data scientist”. Their results come from 250 surveys, in which they asked data scientists about their backgrounds, their tools, and how they think of themselves. They come up with four types of data scientist:

data businessperson: primarily a leader, businessperson, and/or entrepreneur

data creative: a jack of all trades, artist, and/or hacker

data developer: a developer and/or engineer

data researcher: a researcher, scientist, and/or statistician

In terms of skills, Harris and co. asked about business (e.g., product development, budgeting), machine learning/big data (e.g., NoSQL, text mining, JSON, SVMs, clustering, Hadoop), math/operations research (e.g., optimization, graphical models, Bayesian/Monte-Carlo stats, CS theory), programming (e.g., sysadmin, Java, C++), and statistics (e.g., visualization, time-series analysis, surveys, GIS, R).

Generally, I think of mosaic plots as plots-against-humanity, but theirs is clear and useful. In particular, I like how it helps you reflect on what kind of data scientist you are (or want to become). Their framework also makes it possible to assess whether you are in the right organization. One of my favorite definitions of integrity:

Integrity: integrating who you want to be with what you do

Does your organization support the kind of data scientist you want to be? If not, can you get it to shift or do you need to look elsewhere? These are hard decisions but market forces are on the data scientist’s side. 40+ hours is a lot of time to spend on anything that doesn’t make you who you want to be.

On that note, three of our favorite data scientists announced they are starting new jobs this week: Hilary Mason is joining Accel Partners, DJ Patil is joining RelateIQ, and Monica Rogati is joining JawBone. We wish them the best of integrity!

One of the personas (okay, personae) that Harris et al develop is a Binita, a Director of Analytics. She’s the representative of the data businessperson. It’s obvious how deep in the data the other three types of data scientists are. So what stands out for me is that even the data businessperson “really likes getting her hands dirty, diving into data sets when she has time”.

This is, I think, a defining characteristic of a data scientist. Btw, maybe we all are increasingly likely to get our hands dirty:

"Get x's hands dirty" in Google Ngram Corpus

“Get x’s hands dirty” from 1900-2000 in the Google NGrams Corpus (click for readability)

But maybe we’re not: see our post on going beyond raw word counts to track trends.

So what exactly does it mean to get one’s hands dirty? There are at least two models here: Gardener or Sartre. In the Sartre-sense, getting dirty hands is doing morally iffy stuff in order to ensure morally proper goals can be achieved later on. (You can also think of this in terms of Machiavelli, Max Weber, and Michael Walzer.) One big issue in data science has always been—and always will be—ethics. Where do the data come from? How are the analyses used? Addressing the ethics of data science is a Big Other Post.

When most data scientists talk about enjoying getting their hand dirty, we are not, of course saying, “I just love it when I get to achieve positive results or avoid disasters by violating the deepest constraints of morality.” We’re usually imagining something like gardening. Or if that sounds too retired, the image might be a kid happily wallowing in the mud. These images capture, alternatively, care and curiosity, planning and playing. These are also ways of being a data scientist.

But let’s connect gardening and Sartre. The data we analyze are social in its nature. This may be obvious if you’re analyzing Twitter or Facebook, but it’s also true if you’re monitoring particles or planets. My own focus is on data created by people (language data) but even measurements of tiny and astronomical things were conceived of and implemented by people trying to do something. And the analyses are also part of a social system.

This is one of the reasons I think social scientists are a crucial part of data science teams. Understanding the data requires understanding where it comes from and how it is getting used. (Check out Steve Miller’s post on computational social scientists.)

But having someone who can get a handle on the kind of meaning that the data points were imbued with when they were created has to be balanced with some real skills. The weeding of hypotheses. As a reference point, consider whether women wear red or pink shirts more at peak fertility.

Be skeptical of the findings from Beall and Tracy (2013). This figure is from that work and shows the percentage of women at high conception risk in two different samples.

You can read Andrew Gelman’s critique of Beall and Tracy (2013) in Slate, but don’t miss their response and his-response-to-their-response in his blog. The fundamental critique is one of too many “researcher degrees of freedom”.

The standard in research practice is to report a result as “statistically significant” if its p-value is less than 0.05; that is, if there is less than a 1-in-20 chance that the observed pattern in the data would have occurred if there were really nothing going on in the population. But of course if you are running 20 or more comparisons (perhaps implicitly, via choices involved in including or excluding data, setting thresholds, and so on), it is not a surprise at all if some of them happen to reach this threshold.

Skepticism is a crucial tool in Gelman’s toolkit. That comes not just from asking statistical questions but asking questions about the questions we’re asking. And that isn’t just about analyzing the wording of survey items. It is also part of thinking through why we’re asking the questions we’re asking. The research question—do women give off visual cues to ovulation—has a lot to do with various ideologies of gender. Consider what it means. It means that saying that underlying my sisters’ choice in clothing is a desire (culturally and/or evolutionarily driven) to be attractive to men. Maybe they just want to look nice for themselves? Maybe they just want to be comfortable and their fuzziest clothing happens to be red?

I pick up the phone. One of my sisters is wearing grey today. “I don’t know where I am in my cycle.” My other sister is ready to have kids. “I don’t want to think about it. Yes. Maybe. I think so. I’m leaving for Brazil tomorrow morning so I don’t want to know.” She’s in a white t-shirt. “I was wearing an orange-ish sweater but it got hot so I took it off. Does that count?”

Whether we are in academic institutions or enterprises, we like finding patterns in vast, complicated, tangles of data. But does what we count, count? Ultimately, we want our analyses and our actions to add up, not to something statistically significant, but to something meaningful.

– Tyler Schnoebelen (@TSchnoebelen)

[sociable]

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Naming camps and corporations

A name game

The funny thing about repetition

The dirty hands of data scientists

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?