Over the last two years, there have been 3,419 items cited at the end of various articles and squibs in the Journal of Computational Linguistics. At the bottom of the page, you’ll find references to all the ones that were cited more than 5 times. Unlike the large and varied body of work in the top conferences, the journal articles tend to show more about the longer-term trends in natural language processing.
When we look at these broader trends, one of the major takeaways is the importance of good data. While most research is natural language processing is about algorithms, the most cited papers are about data sets used to evaluate those algorithms. The data sets allow benchmarking and a kind of lingua franca for researchers in different subfields of natural language processing. (Here’s my parenthetical where I tell those of you in industry to go check out our products page: one of the things we’re experts on is producing good data sets in English and lots of other languages.)
Instead of summarizing the original papers or just copying the abstracts, we thought we’d have some fun and boil the down the landmark advances of our science to social-messaging versions, in honor of how these technologies are now driving much of social media analytics:
Marcus et al ’93: The best syntax is Wall Street Journal syntax. And sometimes strangers talking on phones.
Stanford Parser morphology stuff is awesomesauceamazeballs. Totes refer to Klein & Manning ’03.
Some WordNet sinsets: ‘pride, superbia’, ‘wrath, ira’, ‘lust, luxuria’. Still best for semantic relations. So cite Fellbaum ’98.
Parsing morphologically rich languages is hard. Turkish, Arabic and Slovene, we’re looking at you. Buchholz & Marsi ’06.
The Berkeley Parser maxes the expected rule count. Also state splitting (cf, “North Colorado”). See Petrov & Klein ’07.
Brown reranking parser is epic. And yay, PCFG (Probabilistic Context-Free Grammars not Pacific Gold Corp). Charniak & Johnson ’05.
Gotta evaluate your Arabic/Chinese/French/Spanish machine translation? Cite Papineni et al ’02.
State-of-the-art NLP relies on hand-tagged corpora like PropBank. See Palmer et al ’05. ps-ALL YOUR VERB BELONG TO US
Why not build a thesaurus by using syntactic co-occurrence statistics? Cite Lin ’98. (<<<3 | luvluvluv)
If you’re studying uncertainty you might likely suppose or suspect that you could cite Farkas et al ’10.
Four fun charts
Here are my favorite tables from these resources.
From Papineni et al (2002), when you’re translating individual words (unigrams), machine translation isn’t *horrible*, but once you are doing 4grams (“uh like this one”), machine translation is a lot worse (though notice that the human translations aren’t great by this measure, either).
From Lin (1998), nouns that are used in similar syntactic slots are usually synonyms, but the same method applied to adjectives gets a lot of antonym pairs since we have a lot of rhetorical adjective flourishes. Also notice that we could really fix some of this up by mash-up: stormulent, paramiliformed, improviorating, somftentimes.
Alright, I know it’s ridiculous to throw in this table from Buchholz & Marsi (2006), a hateful thing for me to do, really. But I wanted to show how Arabic (Ar), Slovene (Sl), and Turkish (Tu) really were consistently hard for folks to parse correctly. The rows are labeled by research group attempting to do the parsing, as you can see, the McDonald team and the Nivre team consistently did the best.
And finally, from Palmer et al. (2005), a reminder that syntactic ambiguity is everywhere, even where we don’t expect it. The “ditransitive reading” is the one that you will think of first. The “predicative reading” would be something like a scene in which Mary, furious at John hurled at him the following invective: “You…you…you…DOCTOR!” And then someone reported what she had said.
Read ’em yourself
Buchholz, Sabine, and Erwin Marsi. “CoNLL-X shared task on multilingual dependency parsing.” In Proceedings of the 10th Conference on Computational Natural Language Learning, pp. 149-164. Association for Computational Linguistics, 2006.
Charniak, Eugene, and Mark Johnson. “Coarse-to-fine n-best parsing and MaxEnt discriminative reranking.” In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 173-180. Association for Computational Linguistics, 2005.
Farkas, Richárd, Veronika Vincze, György Móra, János Csirik, and György Szarvas. “The CoNLL-2010 shared task: learning to detect hedges and their scope in natural language text.” In Proceedings of the 14th Conference on Computational Natural Language Learning—Shared Task, pp. 1-12. Association for Computational Linguistics, 2010.
Klein, Dan, and Christopher D. Manning. “Accurate unlexicalized parsing.” In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 423-430. Association for Computational Linguistics, 2003.
Lin, Dekang. “Automatic retrieval and clustering of similar words.” In Proceedings of the 17th International Conference on Computational Linguistics-Volume 2, pp. 768-774. Association for Computational Linguistics, 1998.
Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. “Building a large annotated corpus of English: The Penn Treebank.” Computational Linguistics 19, no. 2 (1993): 313-330.
Palmer, Martha, Daniel Gildea, and Paul Kingsbury. “The proposition bank: An annotated corpus of semantic roles.” Computational Linguistics 31, no. 1 (2005): 71-106.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. “BLEU: a method for automatic evaluation of machine translation.” In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311-318. Association for Computational Linguistics, 2002.
Petrov, Slav, and Dan Klein. “Improved Inference for Unlexicalized Parsing.” In Proceedings of Human Language Technologies, Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 404-411. 2007.
– Tyler Schnoebelen (@TSchnoebelen)