Archive | April, 2013

Are bball and hoops synonyms?

[sociable/]

For language processing tasks, it’s easy to imagine why you would want to conflate synonyms. For example, if you wanted to classify some news article as being ‘sports related’, you would likely be more accurate if you know that shooting hoops and playing bball mean more or less the same thing.

However, there are few pure synonyms in a language and the differences can matter. Whether or not you care about the differences depends greatly on the context. How do you invite your friends to play vs. watch a game? Does the area you live in have a particular preference? This is a post about the information to be gained by looking at variations that, at first glance, seem to mean the same thing. What would we lose by lumping them all together?

As part of text analysis of social media streams, we were parsing sports terms from about 10 Million tweets. In the stream, there were 3,740 individuals that used at least one of the terms basketball, bball, hoops. In terms of mentions, there were 10,944 occurrences of one of the three terms—but basketball was by far the most popular (78.6% compared to 12.1% for bball and 9.35% for hoops).

As a side note, different senses of words complicate the matter. For example, we filtered around 250 non-basketball-related hoops. This was a big chunk—about one-fifth of all mentions. As you might guess, most of the non-basketball occurrences were about ‘jumping thru’ and ‘hula’ hoops. This is important for synonym conflation—you can imagine errors unnecessarily propagating if the unambiguous ‘jumping through hoops’ was conflated with basketball terms. You need NLP systems that are sensitive to these differences, or better yet, can discover them automatically. Sad news, btw, no one reported any bureaucracies making them jump through hula hoops.

Figure 1: Different co-occurrence patterns (except for “watch*”)

In Figure 1, you can see the ratio of observed occurrences to what would be expected if things were at chance. There are 686 tweets with at least one basketball term and lol. Since 12.1% of all the basketball tweets have bball in them, if everywhere were random, we’d expect 686*12.1%=83 tweets with both lol and bball. Instead we observe 135—that’s 168% more than we’d expect by chance.

The further away from 1.0 that an observed/expected value is, the more something is going on. For example, people add links to tweets when they are talking about hoops a lot more than when they are talking about bball. And as a friend of mine guessed, hoops is most commonly used about college hoops. There’s only half of the occurrences we’d expect for nba+bball. Meanwhile, watch/watches/watching/watched show an example of something that people are equally likely to use with all three terms.

Some other findings:

People seem to talk more about what they are doing/feeling with basketball and bball than with hoops.
Bball goes especially with getting/going/coming “out” physically or checking something “out” virtually. (At the time of the dataset, this was just a motion-type of coming out, not a metaphorical closet coming out. Your turn: which basketball terms are used most with the Jason Collins story?)
If we pull the target words out of tweets and calculate their lengths we see that basketball tweets are shorter (84.9 characters on average compared to 96.9 for hoops and 96.2 for bball). Note that people who talk about the #knicks only had 72.8 characters on average—you can imagine that this has to do with fast typing thumbs speeding along during a game and not being all that wordy. (There are other stories available if you dislike the #knicks, but I’m not telling them today.)

In addition to lumping words, we might wonder about lumping people. Rather than tell you how, say, men and women are using these terms, let’s instead cluster people based on words they use in common (for more about gender, social theory and computational methods, see this paper).

People who use a lot of African American Vernacular English terms across all of their tweets (e.g., finna) tend to like bball and dislike hoops. But whoa-there-on-monolithic-statements, there’s more than one AAVE-heavy cluster and the different clusters use the terms at different rates. What’s more, a cluster of tech people who especially talk about api‘s, ios, ui‘s, portal‘s, plugin‘s, and developers also like bball and dislike hoops. And people who talk about startups and brewing like bball (but lack the antipathy for hoops—they use it at chance).

People who talk about #socialmedia, linkedin, #photo, seo, webinar‘s, infographic‘s, and klout really like to use hoops and really avoid bball. The same pattern holds for a separate cluster of users who talk about #americanidol, hipsters, #oscars, and #goldenglobes. If you were talking to either of these groups, go for hoops.

To get a sense of how much variation there is in our three terms, let’s see what happens when we limit ourselves to people who use at least one of the terms in five or more tweets. There are 586 such people.

193 use only basketball
4 use only hoops
2 use only bball
167 use basketball and bball
135 use basketball and hoops
85 people use all three

In other words, these terms are in some sort of variation for about 66% of people.

Recognizing that different people talk differently—and that the same people talk differently at differently at different times—one of the key tasks for natural language understanding and text analysis is to find insights in the differences.

– Tyler Schnoebelen (@TSchnoebelen)

[sociable/]

Comments Leave a Comment
Categories Uncategorized

5 facts for a powwow

25 Apr

Today through Saturday is the Gathering of Nations Powwow in New Mexico, so we thought it’d be a good time to highlight some of our favorite things about the indigenous languages of the US and Canada.

There are important stories to be told about these languages. Their disappearance—erasing—is a personal, cultural, historical, and scientific loss. But today we’d like to focus on celebrating their vitality. We’ll ease you in with some English borrowings, then show you a handful of fun things that languages do. There are all sorts of reasons that language revitalization is a good idea. Today we focus on just one of them: joy. Please share your own favorites in the comments.

1. Powwows come from Rhode Island

The word powwow comes from the Narragansett word for shaman (closer to powwaw, originally). When we use it in English, it means a conference or gathering of people. That’s because if there was an important Narragansett gathering going on, a healer/holy person would have been present. So English speakers took the name for an important role as the name for the events where he showed up.

Like a lot of native languages, there aren’t any native speakers of Narragansett left—though the tribe is trying to revitalize it based on earlier records of the language. One small record of the language is in the words you and I speak.

In addition to powwow, Narragansett also gives us squash (the edible kind, not the verb—from askútasquash; the verb kind of squashing was around during Middle English when it was squachen from the Old French esquasser from the Latin quassāre, ‘to shatter’). It’s where papoose comes from (papoòs, ‘child’). And fans of Family Guy should know that Narragansett gave us the name of a certain edible clam: quahog (from poquaûhock).

2. English didn’t just borrow powwow and Massachusetts

Native American languages also gave us: moose, raccoon, skunk; caucus, kayak, pecan. And bayou. And husky.

Moose is from mos in Eastern Abenaki (Maine, Quebec)
Raccoon is from the Virginia Algonquian (Powhatan) arathkone
Skunk is from the Massachusett squnck (one guess where Massachusetts comes from)
Caucus is from the Algonquian caucausu (‘counselor’)—but this is contested. The first use in English is about Boston’s Caucas Club, which could be from the Algonquian or from the medieval Latin for ‘drinking vessel’ (both are relevant, as you can guess).
Kayak is from Inuit and Yupik qajaq
Pecan is from French (pacane), but it got into French from the Illinois pakani
Bayou also comes to English from the French, but it got there (maybe) from Choctaw bayuk

Husky is a corruption of Eskimo. But to be totally honest, this is really just the origin of the dog breed. The ‘heavily built person’ sense probably comes from husk (which itself comes from ‘little house’ and ‘apple core’ in Middle Dutch: hūskijin). Here’s an extra one to make up for that: Winona is from Lakhota Winúŋnã, which means ‘firstborn daughter’.

3. Not everyone needs six syllables to do something unintentionally

Ahtna, Koyukon, Tanana, and Carrier are languages spoken in Alaska, the Yukon, and in British Columbia (the language family is called “Athabaskan” and also includes Navajo). These languages have what’s been called an “errative” marker. As in ‘err’ or ‘oops’. Basically, you add it to a verb to mean “this verb-ing was unintentional”.

In Koyukon the form is –naa-, so you can laugh-naa or buy-naa new shoes. (I know it looks like you’d say naa like /ah/ but the spelling in Konyukon means that the vowel is more like vowel in English hat.)

4. Havasupai speakers don’t get “him” and “him” confused.

Down near the Grand Canyon, Havasupai speakers are able to be rather specific with their pronouns:

yaj: he (right by me)
vaj: he (near me)
nyuj: he (near you)
thaj: he (over there)
waj: he (far away, out of sight)
vuj: he (that was here but is gone now)

(Look at that last one again it’s my favorite.)

5. Cherokee verbs are awesome

People often talk about Cherokee’s great script and you owe it to yourself to go check it out. Is there a better way to write ‘ya’ than Ꮿ?

But what I want to talk about instead is that for a lot of Cherokee verbs you are required to have a marker that matches the thing that the verb is affecting. So if you ‘give’ something, it matters whether you’re giving something living, liquid, bendy, long, or compact. How fun is that?

(3) Wèésa gá-káà-èè’a

cat 3rdperson.to.3rdperson-LIVING-give.present

‘She is giving him a cat.’

(4) Àma gá-nèè-èè’a

water 3rdperson.to.3rdperson-LIQUID-give.present

‘She is giving him water.’

(5) Àhnàwo gá-núú-èè’a

shirt 3rdperson.to.3rdperson-FLEXIBLE-give.present

‘She is giving him a shirt.’

(6) Gànsda aá-d-èè’a

stick 3rdperson.to.3rdperson-LONG-give.present

‘She is giving him a stick.’

(7) Kwàna aá-h-èè’a

peach 3rdperson.to.3rdperson-COMPACT-give.present

‘She is giving him a peach.’

If you don’t know which marker to put in there, you use the ‘compact’ one. (I should probably also mention that there are some changes to the first prefix, please let me know if you know why.)

These examples come from Mary Haas by way of Barbara Blankenship. But I want to be sure to also credit Cherokee speakers Virginia Carey and Levy Carey who helped Blankenship in 1997 and the speakers Haas worked with back in 1948, whose names seem to be lost to us (which was all too often the case, though my impression is that speakers get acknowledged fairly consistently these days).

Special bonus!

I’m including one more sentence just because it makes me happy. It’s from Eastern Ojibwa, which is spoken in Southern Ontario (about 26,000 speakers by a 1998 census). Eastern Ojibwa is part of a larger group of languages that are often called Ojibwa but also sometimes known as Anishinaabemowin. (Btw, English chipmunk may come from Ojibwa ajidamoonʔ.)

(8) “mii maanpii wii-bkeyaanh” kido giiwenh wa mko

but here intend-turn.off. I singular.say. he/she singular.it.is.said that bear

“Well, this is where I turn off,” the bear said.

– Tyler Schnoebelen (@TSchnoebelen)

[sociable/]

Comments Leave a Comment
Categories Uncategorized

We’ve lost that lovin’ feelin’

22 Apr

It’s time to send out a cultural SOS to Glee, Taylor Swift, Lil Wayne, and Justin Bieber: love needs your help.

One of the building blocks of language technology is named entity recognition: identifying the proper nouns and names of real-world items. For people and locations, it’s pretty easy. For organizations and products it gets a bit more ambiguous, since organizations and products often take their names from everyday items (e.g., apple). But the hardest entities to identify are titles.

One reason we need to identify titles is that they often act like single words. Take the sentence Total Eclipse of the Heart’s music video made no sense to me when I first saw it. The possessive ‘s actually belongs to the whole title, not just Heart. Similarly, it refers to the whole title, plus music video, not to a specific word like Eclipse. So if you want to understand what’s happening in texts that have titles, you have to understand how titles work.

It was when we started looking to automatically identify song titles, however, that something else found us:

Love.

From 1890 to the end of 2012, there have been 39,044 songs that have hit the top of Billboards charts (thanks, Whitburn Project!). 3,583 of these songs have love-words (love, loves, loved, lover, lovin’, luv, etc.). But for the last couple of years, the percentage of hits with love in the title has been only 30% of what it was in 1980, when people knew how to love.

graphs_Percentage_hq 1980 really was a standout year for love—15.8% of all songs had love (or some variation) in their titles. Some of the tops:

Queen’s Crazy Little Thing Called Love as well as Need Your Loving Tonight
Barbra Streisand’s Woman in Love
Air Supply’s Lost in Love and All Out of Love (the first one peaked in May, the second one in September)
Spinners had the medley Cupid/I’ve Loved You for a Long Time
And my chicken-eating hero, Kenny Rogers, had Don’t Fall in Love with a Dreamer and Love the World Away (Attention linguists and James Bond super-villains: why wipe away tears when you can love away the world?)

The mid-1950s wasn’t such a bad time, either:

Pat Boone was crooning Love Letters in the Sand and April Love
The Four Aces declared Love Is A Many Splendored Thing (when was the last time you used the word splendored?). They also had Melody of Love and A Woman in Love
Tab Hunter was after Young Love
Joan Weber demanded Let Me Go, Lover!

There were also a lot of Love me‘s running around in the mid/late 50s: Love Me Forever, Don’t Ever Love Me, (My Baby Don’t Love Me) No More, Doesn’t Anybody Love Me?, Love Me Or Leave Me, Love Me, Love Me to Pieces.

I have a special place in my heart for a couple titles from the 1970s that really went all out on love—for example, My Baby Loves Lovin’ and Lovey Dovey Kinda Lovin.

Our most recent peak was 1993, when 13.8% of the songs were lovey dovey. The top ones:

Mariah Carey’s Dreamlover
Janet Jackson’s That’s the Way Love Goes
UB40’s Can’t Help Falling in Love
Meat Loaf was serious that I’d Do Anything for Love (But I Won’t Do That)
Vanessa Williams and Brian McKnight dueted a definition of what Love Is

By contrast, 2005 was the loneliest year of recent memory—only 3.2% of songs. 1999 was about the same (3.5%). 2007 was closer to 3.7%. Some examples:

Jennifer Lopez’s If You Had My Love (1999)
Mario’s Let Me Love You (2005)
The Game’s Hate It Or Love It (2005)
Lil Jon & The East Side Boyz’ Lovers and Friends (2005)
Ludacris’ Runaway Love (2007)

The Game’s song might make you wonder about hate. Hate-related words only make it into 30 song titles across 1890-2012 (eight of which have love in them, too). The 2002-2006 time period had the highest amount of hate songs—11 in total. So a lot of hate and not a lot of love. Dark period, folks. We can say to Toby Keith, Kenny Chesney, Tim McGraw, 50 Cent, Nelly, R. Kelly, Rascal Flatts, Eminem: This was your time at the top of the charts. What was your issue with love, yo? (Maroon5, Beyoncé, Phil Vassar, Avant, Justin Timberlake, Ludacris—you kinda tried, so thanks.)

The fun spelling luv doesn’t hit the top of the charts til Joe’s I’m in Luv in 1993. It hit its stride in 2002 (What’s Luv? asked Fat Joe). You may have cause to recall I’m N Luv (Wit a Stripper) in 2006 (T-Pain) or last year’s Give Me All Your Luvin’ (Madonna. Wait, what? Ah, featuring Nicki Minaj, okay). Just so you know, it is songwriters T. Nash and C.A. Stewart that especially love luv. [Update: see comments about disambiguation, but you may know these guys as The-Dream and Tricky Stewart, respectively.]

2012 was better than 2011 (5.8% vs. 4.2%), but I have to believe we can do better. Here are some of the recent hits:

Rihanna’s We Found Love…and then Glee’s version (41 weeks and 1 week, respectively)
Enrique Iglesias’ Tonight (I’m Lovin’ You) (Why do people try to trap love in parentheses?)
Lil Wayne’s How to Love
George Strait’s Love’s Gonna Make It Alright
Whitney’s (well, it was Dolly’s first) I Will Always Love You (3 weeks for Whitney in 2012, 1 week for the Glee Cast in 2012; 26 weeks for Whitney in 1992; 14 weeks for Dolly Parton in 1982)

Speaking of which, 9 of Whitney Houston’s 40 Billboard hits mention love. Other lovers as big or bigger than Whitney include:

Frank Sinatra (21 of 159 songs have love*)
Paul Anka (14 of 53)
Jackie Wilson (13 of 54)
Leo Reisman & His Orchestra (12 of 71)
The Supremes (11 of 32)
Jo Stafford (11 of 72)
Spinners (10 of 72)
Donna Summer (10 of 32)
Diana Ross (10 of 40)
Stevie Wonder (9 of 55)

Faith Hill has the love song that stayed up top for the longest—The Way You Love Me for 56 weeks (2001). Love Story—the 2009 Taylor Swift song stayed on the charts for 49 weeks. Only that one and The Way I Loved You (from 2008) have love in their titles—she has 59 hits altogether. Just in case you wondered, Luther Vandross’ number is 7 of 52; Marvin Gaye is 7 of 52. Point to take home: you can probably be a big lover without ever saying love.

Named entity extraction is one of the things that people ask our systems to do—that is, “find me all the proper nouns”. We are very happy to help you with this kind of information extraction. But we will probably decline requests for love extraction.

Mini-appendix

When people produce word clouds, they typically remove common words—thinking they don’t really give much information. Here’s what it looks like when we put in all the words—love is still a very big deal. There may be some other things going on. In normal speech, the is usually about 2.3 times more frequent than a but here it’s only 1.9 times bigger: 438 vs 226. That may not be a big deal, though one can imagine that definiteness and givenness are doing interesting things—that is, the first time you mention something, it’s likely new and preceded by an a. After that, it becomes definite/given. (“I saw a mermaid”. “What was the mermaid wearing?”)

At least in terms of the spoken part of COCA—which is largely from TV interviews—I and you are about equally frequent. In song titles, as you can imagine, you is more frequent (1.3 times more frequent than I—438 vs. 334 instances). More to the point, it’s exceptionally rare that in normal speech you is as frequent as the (the is about 2.7 times more frequent in the spoken portion of COCA). But here they are the same. You may be able to make hay with these facts. Or not. But small words aren’t meaningless noise, which is essentially what you’re saying they are when you put them in stop lists.

Here’s the full year-by-year chart, but there’s not really a lot of data in the early years, so probably don’t look too seriously at it til the 1930s or maybe even the 1950s.

graphs_Song_Titles_hq

In terms of identifying song titles (and this is worrying for us), the word love is only half as useful to us now as it would have been a few decades ago. It’s a good example of how language processing systems need to adapt and change over time (something for another post).

Now for some other text analysis odds and ends. First, if you were curious, love songs tend to have longer titles—it’s not huge, though: 22.0 characters vs. 18.0 characters on average (that is significant since there’s so much data).

Also, I feel like I’d be remiss if I didn’t give a little more credit to the folks who do a lot of the writing. The major song writers for 2011-2012 have been D. Carter, Max Martin, L. Gottwald, and A. Graham—these four writers have are in the writing credits for 110 of the 925 top songs from the last two years. [Update: see comments regarding disambiguation, you may know D. Carter as Lil Wayne, L. Gottwald as Dr. Luke, A. Graham as Drake]

Finally, in the interest of full disclosure, putting love in your title doesn’t seem to get your song to last longer. Actually, the average number of weeks for a non-love song is 11.4 weeks, for a love song it’s 9.4 (again, since there’s so much data, this is statistically significant). You’re gonna have to love for love, not for platinum.

– Tyler Schnoebelen (@TSchnoebelen)

Comments Leave a Comment
Categories Uncategorized

Code-switching, Morning Edition, and a new site!

16 Apr

Hi everyone,

Check out info about my appearance on Morning Edition this morning by going here:

http://idibon.com/code-switching-as-featured-on-nprs-morning-edition/

That is ALSO probably where I’ll be spending a bit more time writing stuff that falls in (and outside of) corpus linguistics. But I’ll probably put stuff here that doesn’t really fit on a company blog. (Uh, see my post here on topic modeling, for example.)

Also think about following me on Twitter to get micro-updates and links to bigger ones:

https://twitter.com/TSchnoebelen

Comments Leave a Comment
Categories Uncategorized

Code-switching on Morning Edition

15 Apr

[sociable/]

Code Switch is the name of a new project at NPR focusing on race, ethnicity and culture. The head of the project’s blog, Gene Demby, introduced it on Morning Edition today. (Morning Edition is a good name—at 5:50am I awoke to a call from my sister in Iowa who was excited to hear my voice on her work commute). Gene asked me to say a few words about what happens when people code-switch—that is, shift between one way of speaking and another.

Listen here: http://www.npr.org/2013/04/15/177275789/new-npr-team-covers-race-ethnicity-and-culture?sc=17&f=3

Often “code switching” is about switching languages–maybe you grew up speaking Spanish but you switch back and forth between Spanish and English when you’re talking to your siblings. But even if you only speak one language, you are still shifting it around, like when you move between informal and highfalutin. Different “linguistic resources” (pronunciations, spellings, words, syntactic constructions) have different associations.

What you might not guess from my clip is that my mom is not actually from the South (though since this aired she has started signing emails, “Your Southern Cheryl Belle”). So why is it that she occasionally has a Southern accent? The answer has to do with the fact that there are all sorts of linguistic resources out there for us to draw on. In America, Southern accents are so prominent that even if you’ve never met anyone from the South, you still might draw(l) on Southern speech sometimes.

Depending on what you’re talking about and who you’re talking to, you may consciously or subconsciously shift to bring in some set of associations. For example, Southern accents have happy things like politeness associated with them and more negative things like stupidity (check out work by folks like Dennis Preston and Kathryn Campbell-Kibler). I hope it goes without saying that speaking in one language, accent, or style doesn’t make you polite or rude, smart or stupid. The associative nature of language is useful: it gives us shortcuts and expressiveness. But there is a darker side: we may dismiss people because we think we know everything we need to know about them after they open their mouths. We don’t.

Most of the world speaks more than one language, and so many people regularly switch between languages mid-sentence. These shifts do carry meaning: a change in sentiment, privacy, or simply to capture some expression that cannot be truly translated (check out Peter Auer and Carol Myers Scotton). Natural language processing systems have tended to ignore code-switching: at best, they give sentence-level identification. Our systems are sensitive to shifts wherever they happen and understand them as meaningful moments.

– Tyler Schnoebelen (@TSchnoebelen)

[sociable/]

Comments Leave a Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Are bball and hoops synonyms?

Figure 1: Different co-occurrence patterns (except for “watch*”)

5 facts for a powwow

1. Powwows come from Rhode Island

2. English didn’t just borrow powwow and Massachusetts

3. Not everyone needs six syllables to do something unintentionally

4. Havasupai speakers don’t get “him” and “him” confused.

5. Cherokee verbs are awesome

Special bonus!

We’ve lost that lovin’ feelin’

Mini-appendix

Code-switching, Morning Edition, and a new site!

Code-switching on Morning Edition

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?