Archive | January, 2014

9 top computational linguistics citations

28 Jan


Over the last two years, there have been 3,419 items cited at the end of various articles and squibs in the Journal of Computational Linguistics. At the bottom of the page, you’ll find references to all the ones that were cited more than 5 times. Unlike the large and varied body of work in the top conferences, the journal articles tend to show more about the longer-term trends in natural language processing.

When we look at these broader trends, one of the major takeaways is the importance of good data. While most research is natural language processing is about algorithms, the most cited papers are about data sets used to evaluate those algorithms. The data sets allow benchmarking and a kind of lingua franca for researchers in different subfields of natural language processing. (Here’s my parenthetical where I tell those of you in industry to go check out our products page: one of the things we’re experts on is producing good data sets in English and lots of other languages.)

Instead of summarizing the original papers or just copying the abstracts, we thought we’d have some fun and boil the down the landmark advances of our science to social-messaging versions, in honor of how these technologies are now driving much of social media analytics:

Marcus et al ’93: The best syntax is Wall Street Journal syntax. And sometimes strangers talking on phones.

Stanford Parser morphology stuff is awesomesauceamazeballs. Totes refer to Klein & Manning ’03.

Some WordNet sinsets: ‘pride, superbia’, ‘wrath, ira’, ‘lust, luxuria’. Still best for semantic relations. So cite Fellbaum ’98.

Parsing morphologically rich languages is hard. Turkish, Arabic and Slovene, we’re looking at you. Buchholz & Marsi ’06.

The Berkeley Parser maxes the expected rule count. Also state splitting (cf, “North Colorado”). See Petrov & Klein ’07.

Brown reranking parser is epic. And yay, PCFG (Probabilistic Context-Free Grammars not Pacific Gold Corp). Charniak & Johnson ’05.

Gotta evaluate your Arabic/Chinese/French/Spanish machine translation? Cite Papineni et al ’02.

State-of-the-art NLP relies on hand-tagged corpora like PropBank. See Palmer et al ’05. ps-ALL YOUR VERB BELONG TO US

Why not build a thesaurus by using syntactic co-occurrence statistics? Cite Lin ’98. (<<<3 | luvluvluv)

If you’re studying uncertainty you might likely suppose or suspect that you could cite Farkas et al ’10.

Four fun charts

Here are my favorite tables from these resources.

From Papineni et al (2002), when you’re translating individual words (unigrams), machine translation isn’t *horrible*, but once you are doing 4grams (“uh like this one”), machine translation is a lot worse (though notice that the human translations aren’t great by this measure, either).

Screenshot 2014-01-24 18.20.27

From Lin (1998), nouns that are used in similar syntactic slots are usually synonyms, but the same method applied to adjectives gets a lot of antonym pairs since we have a lot of rhetorical adjective flourishes. Also notice that we could really fix some of this up by mash-up: stormulent, paramiliformed, improviorating, somftentimes.

Screenshot 2014-01-24 18.19.31

Alright, I know it’s ridiculous to throw in this table from Buchholz & Marsi (2006), a hateful thing for me to do, really. But I wanted to show how Arabic (Ar), Slovene (Sl), and Turkish (Tu) really were consistently hard for folks to parse correctly. The rows are labeled by research group attempting to do the parsing, as you can see, the McDonald team and the Nivre team consistently did the best.

Screenshot 2014-01-24 18.16.44

And finally, from Palmer et al. (2005), a reminder that syntactic ambiguity is everywhere, even where we don’t expect it. The “ditransitive reading” is the one that you will think of first. The “predicative reading” would be something like a scene in which Mary, furious at John hurled at him the following invective: “You…you…you…DOCTOR!” And then someone reported what she had said.

Screenshot 2014-01-24 17.46.07

Read ’em yourself

Buchholz, Sabine, and Erwin Marsi. “CoNLL-X shared task on multilingual dependency parsing.” In Proceedings of the 10th Conference on Computational Natural Language Learning, pp. 149-164. Association for Computational Linguistics, 2006.

Charniak, Eugene, and Mark Johnson. “Coarse-to-fine n-best parsing and MaxEnt discriminative reranking.” In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 173-180. Association for Computational Linguistics, 2005.

Farkas, Richárd, Veronika Vincze, György Móra, János Csirik, and György Szarvas. “The CoNLL-2010 shared task: learning to detect hedges and their scope in natural language text.” In Proceedings of the 14th Conference on Computational Natural Language Learning—Shared Task, pp. 1-12. Association for Computational Linguistics, 2010.

Fellbaum, Christiane. “WordNet: An electronic lexical database. 1998.” WordNet is available from http://www. cogsci. princeton. edu/wn (2010).

Klein, Dan, and Christopher D. Manning. “Accurate unlexicalized parsing.” In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 423-430. Association for Computational Linguistics, 2003.

Lin, Dekang. “Automatic retrieval and clustering of similar words.” In Proceedings of the 17th International Conference on Computational Linguistics-Volume 2, pp. 768-774. Association for Computational Linguistics, 1998.

Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. “Building a large annotated corpus of English: The Penn Treebank.” Computational Linguistics 19, no. 2 (1993): 313-330.

Palmer, Martha, Daniel Gildea, and Paul Kingsbury. “The proposition bank: An annotated corpus of semantic roles.” Computational Linguistics 31, no. 1 (2005): 71-106.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. “BLEU: a method for automatic evaluation of machine translation.” In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311-318. Association for Computational Linguistics, 2002.

Petrov, Slav, and Dan Klein. “Improved Inference for Unlexicalized Parsing.” In Proceedings of Human Language Technologies, Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 404-411. 2007.

– Tyler Schnoebelen (@TSchnoebelen)



Innovating because innovation

15 Jan

Imagine that you want copies of some document in 1978 (if you weren’t already). You’re in a hurry. But you’re stuck in a line for a photocopier because you don’t have any of the innovations that we rely upon today. There’s one major innovation in efficiency that we’ll focus on here. In a minute. First let’s talk about how to get ahead in life.

In 1978, Ellen Langer, Arthur Blank, and Benzion Chanowitz published a paper about experimenters going up to lines of people waiting to make copies. In the first condition, the experimenter walks up and says, “Excuse me, I have (5 or 20) pages. May I use the Xerox machine?” In the second condition, they say “Excuse me, I have (5 or 20) pages. May I use the Xerox machine, because I’m in a rush?” In other words, they give a reason for needing to skip the line. In the third condition, they say, “Excuse me, I have (5 or 20) pages. May I use the Xerox machine, because I have to make copies?” Which is pretty empty in its content. Of course you have to make copies.

What Langer et al. were wondering is whether giving even a meaningless reason worked in skipping the queue. When the experimenter was only doing 5 pages, the reason and the “empty reason” results were basically identical and did improve the chances to be permitted to go first. But when you were talking about a lot of copies (“20 pages” was the effortful condition), suddenly people needed more information. The hollow answer wasn’t enough.

Now, consider one of the most important innovations of efficiency to come about since then. This one is fairly recent, probably the last 5 years: the ability to give a reason without so many words. Imagine how much time people in photocopier lines wasted saying, “Because I have to make copies”. Obs, it’s quicker to say “because copies”.

Because X was chosen as the American Dialect Society’s Word of the Year for 2013. Why? The basic idea is that normally you have to say because of X or the “X” has to be a full clause (something that would sound fine on its own without a because in front of it). This kind of change is very unusual and you can find more coverage about because X by Gretchen McCullochStan CareyNeal Whitman, and Geoffrey K. Pullum among others. In this blog post, I’ll report how the construction is used and also how we might think of innovation inside and outside of language.

Behind this article is a simple notion: language won’t stand still. Most natural language processing systems are built using models of language that are fixed in time, which means they can go stale very quickly. That is why there is not (and never will be) a static piece of software that you can download that solves some problem like sentiment analysis or entity identification: people keep finding new ways to express sentiment about new entities in the world. Sometimes it is simply the topics that change, but sometimes it is the language itself.

On innovations

No one seems to have a real corpus of the because X innovation, so I put one together using recent data from Twitter. One of the things you want to understand in studying an innovation is the context: who is using the innovation and in which ways?

Marketing folks and product managers need to follow the diffusion of innovations. “Innovation” has been a particularly big business buzzword since about the mid-1990s, but philologists, dialectologists and sociolinguists have actually been studying it for…well, you could say hundreds of years but let’s put the real fire in the belly around the 1870s. The early methods are pretty rough, but the same questions that folks in industry want to know about how ideas and products and services travel through society, sociolinguists want to know for language.

It is easier to adopt a new word, pronunciation or grammatical construction than it is to buy a new mobile phone, but the fundamentals are quite similar: a style—of speaking just as dressing or being—requires access and a kind of entitlement to adopt the style. If you don’t know about an innovation, you can’t really adopt it, but you also may not adopt it because you aren’t licensed to. That’s what makes something like Bronies—guys who love My Little Pony—stand out. They don’t seem licensed to adopt that product. What are the social mountain ranges that block the spread of innovations and what are the Autobahns that promote them?


One relevant finding: “The conduits of innovation are the multiple weak ties of everyday urban interaction in the neutral areas outside close-knit community territories” (see Milroy and Milroy 1985 for more about this). That basically means that innovations enter a community because of interactions with folks outside the community.

I’ve only been tracking because X for the first couple weeks of January, so I don’t have a lot to say about how it’s spreading through the network, although you’ll get a sense about it from the top words and from the demographics/interests below. But let me suggest that the construction skews towards “interpersonal”. That is, 36% of because X tweets involve an interaction with one or more other users (@-messaging them).

There’s a kind of sociability that we find here that is greater than what we find for, say, people who are tweeting about their mobile phone networks. Among the major telecommunications companies in the US, AT&T has the lowest percentage of tweets-with-@’s (22%) and T-Mobile has the most (31%). More striking is the fact that 15% of because-X tweets involve multiple @-recipients while the telecom companies get between 3% (AT&T) and 5% (T-Mobile) with multiple folks in them.

Obviously a message (or a grammatical construction) that habitually reaches more people, has a greater chance of spreading. But this does depend upon the recipient’s ideology about the message and the speaker/author. Imagine some group you are actively not a part of and that you have some distaste for, say, neoconservative hipsters. If 80% of all the times you see because X or Sprint’s ultra-coolitude it’s coming from a neocon hipster, you are unlikely to adopt the innovation except as a way to mock them. (Although be careful, if you use it ironically long enough, you’ll accidentally start using it for reals. Uff. Like that.)

Who uses “because X”?

Language reflects and creates who we are. So what do tweeps who use because X want to talk about—outside of these particular because-X tweets?

Compared to other folks tweeting at the same time, because-X‘ers are fans of Sherlock, YouTube, Tumblr, One Direction (especially Harry), Justin Bieber, and Ariana Grande. They also like “bands” more generally, pizza, sex, cats, and books. They are decidedly less likely to talk about software, basketball, NASCAR, business, or to use words associated with African-American Vernacular English.


In terms of broad demographics, it does look like young women are leading the innovation. Folks who use because X are mostly in the US (but there are a fair number of Londoners using it). The construction is fairly evenly spread across the US, though there’s a bit more on the East Coast and in Arizona than we’d expect given the base levels of Twitter uses in those places.

The patterns

We model innovation in a number of ways, but we weren’t explicitly tracking because X. So I grabbed 28,294 recent tweets. The final counts I’m going to use are based on 23,583 uses: basically, most of the junk in the data comes from people who are writing very long tweets (so the because x is really just because the tweet was truncated). For that reason, I only used tweets that were 125 characters or less (Twitter’s maximum is 140). I also wanted to restrict to “real” Twitter users so I only used tweeps whose follower and propensity-to-be-retweeted counts were within two standard deviations of the mean. In other words, real newbies/non-users and super-users are filtered out. I further uniqued the messages. I did allow RT’s to stay in if they had additional content in them, though in this particular case the results when you remove RT’s are almost the same. Fwiw, this is the kind of data cleaning that goes on behind the scenes in any sophisticated text analysis.


Notice that the top term (by far) is yolo. If you spell it out, because you only live once is actually completely standard (you only live once is an example of a fine full clause). But yolo is a lot like an interjection. Other “compressed clauses” are ily (‘I love you’), idgaf (‘I don’t give a fuck’), idk (‘I don’t know’), idc (‘I don’t care’), ilysm (‘I love you so much’). If we group all of the items that have 50 occurrences or more, then compressed clauses come in second place (21.78% of the total) after nouns (32.02%).

Part of speech Word counts ≥ 50
Noun (people, spoilers) 32.02%
Compressed clause (ilysm) 21.78%
Adjective (ugly, tired) 16.04%
Interjection (swegomg) 14.71%
Agreement (yeahno) 12.97%
Pronoun (youme) 2.45%

What are people yolo’ing about? Movies and TV shows, bed, clothes, games, and school. As an aside, most of the because yolo’s are self-mocking about the quotidian things people do. If it isn’t self-mocking, we need to have some serious conversations about the fact that wearing fuzzy socks to work is not really something you should do because yolo.

Because same might require some explanation. It usually happens as an answer to a question indicating, ‘I sure do myself’. It’s particularly frequent in the rhetorical do you ever construction:

do you ever just lie on the floor wearing neon clothing while wiggling around and saying your a glow worm because same

Remember that night like 2 days ago when I spent literally 2 hours crying over Harry because same

Finally, you might be interested in swag and its versions (swaggy, sweg, yoloswag, swagtastic). Because swag is about conveying a personal style, but it has a similar expressive force as because awesome. It tends to happen as a request or excitement about Twitter followers, to explain why the tweep is watching a particular movie/TV show, or in reference to what the tweep is wearing.

@someone hey connor you should follow me because swag

wearing my favorite aeropostale hoodie because swaggg [here’s a post on expressive lengthening—TJS]

Watching the suite life if zack and Cody because swag

I just got home because swaggity swag because sagacity [I’m pretty sure this was meant to be swagacity because spellcheck—TJS]


Some extra notes for linguists

Verbs are very rare, though maybe because stop is something you want to incorporate into your own speech (43 instances). You’d think that something like the highly frequent want would appear a lot more: it only appears 7 times. Most -ing words are really being used like adjectives, the top exceptions have low counts: crying has 16, dying has 11. Because sleep is the most frequent verby thing (48 instances), but its use seems very nouny, Set an alarm for 8 so I could get up and be productive early. Reset an alarm for 930 because sleep.

And because I had some folks ask: for -ly adverbs, the top ones go to seriously (46) and obviously (44).

@someone do they want me to hate Thomas or like him because seriously?!? I hate him. But then he does something good.

Moms who tell non-moms pregnancy/child birth horror stories are the G damn WORST. With capital letters. Because seriously.

Alexander resisted bedtime for an hour then finally fell asleep snuggling the bulb syringe/snot sucker. Because obviously.

I specifically was looking for patterns that involved because being followed by a single word that was then followed by the end of the tweet, end of a line, some punctuation, emoticon, or by “RT”. Some of these ended up having multiple because X phrases in them that involve multiple words.

  • gonna upload some pics from my photoshoot in the summer, because well… because the internet..
  • @imanonymizingthisusername because life. Because the universe. Because everythingggggg
  • So like his name is Hope because final fantasy 13 because fvaorite character because Hope
  • Do it before camp AJK, because not.enough time already , All Competition Is Coming.

These numbers don’t include the very frequent because why, because what, and because because (because). Feel free to disagree with me about pulling them out. The first two are especially frequent and like the compressed clauses and words like yes and no that can also stand alone, they may well pave the road for nouns and adjectives.

For folks who aren’t in sociolinguistics, here are some reading suggestions about language and innovation. In addition to Milroy and Milroy (1985) on social networks, these are other works that tend to get cited in the major sociolinguistics journals in recent years.

– Tyler Schnoebelen (@TSchnoebelen)