Tag Archives: semantics

COCA: What a fantastic source of data!

3 Feb


425 million words from 1990-2011.

I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts of data. It’s a lot bigger than most of the corpora you may be going to now (CELEX, Switchboard, etc). And while its sentences aren’t annotated with tree structures, it does have part-of-speech info (and makes it really easy to get collocates).

This post is really about getting started with COCA, but I’ll try to do it in the framework of a particular linguistic phenomenon. But if you get nothing out of this post but USE COCA, that’ll be enough. It also makes it easy to compare to historical American English, the BNC, and Google Books/N-gram (though I won’t be showing that here).


A few months ago Anna Chernilovskaya came by and presented work she and and Rick Nouwe had been doing on exclamatives like:

(1) What a beautiful song John wrote!

Their work is set against Rett (2011), which sees wh-exclamatives as being a speaker expressing something that is noteworthy in the given context–in other words, to say (1) you think there’s something noteworthy about John’s song relative to the standard beauty of songs.

If you don’t have a degree adjective like “beautiful” (instead something like What a song John wrote!), Rett says you have an operator that acts like a silent adjective, so you’re exclaiming about beauty, weirdness, or complexity. Chernilovskaya and Nouwe are saying, “Nah, it’s simpler–it’s just direct noteworthiness. Drop all this degree stuff.”

My question: how are wh-exclamatives actually used by English speakers? My intuition is with C&N that it’s just about noteworthiness. But is that all?

First steps with COCA

Go to http://corpus.byu.edu/coca/, look over to the upper right–you can log in or register, as appropriate.

Most of the action is going to be in the panel on the left. Some of the stuff is hidden to reduce complexity, so if you want to see part-of-speech tags, just click on the text that says “POS LIST” and you’ll get a drop-down menu you can choose stuff from. If you want to do collocation stuff, just click “COLLOCATES”, etc.

COCA's left-pane

Let’s go ahead and start collecting examples of wh-exclamatives. In the “WORD(S)” text box, we could type:

what  [at*]

And that would get us 82,074 sentences, including both what a and what an. It would also get us what the, what every, and what no. Including sentences like Her would-be opponents are pondering these questions and what the answers mean for their own possible candidacies.

That’s rather too much. Let’s try the following–the [y*] means “any punctuation”.

[y*] what [at*]

Now we have 18,063 sentences. This may help us see that I mean, what the heck is interesting, but perhaps it’s still taking us too far afield. Let’s just do what you are probably thinking we should’ve done from the beginning:

[y*] what a|an

8,919 results. Looking through the results, this is a pretty good query. We could restrict which punctuation we care about, but if you go ahead and do this query yourself, I think you’ll see why we want to keep most punctuation.

Search results in COCA (click to make 'em bigger)

The top right box shows all the matches lumped together by punctuation/article, click on any of them and the actual sentences will show up below. Notice the drop-down Help box on the far right above the sentences. Use that to find out more about query syntax.


Let’s see what the most common words are that go with a what a|an construction. We have a few options, we can run queries like:

what a|an [j*] [n*]
what a|an [j*]
what a|an [n*]

This third one is pretty interesting because it helps us see which sorts of exclamations are made without an adjectives. (One person in our semantics/pragmatics group confided that his father always said What a baby! when confronted by an unattractive child–there’s a pressure to exclaim something but he doesn’t want to lie, so this strategy manages to give the right form without quite meaning the meaning that the parents might take away.)

Here are the top 15, though the first one ends up not really being part of our pattern a lot of the time.

1  WHAT A LOT 507
4  WHAT A MAN 183
6  WHAT A MESS 159
9  WHAT A DAY 150
11  WHAT A WOMAN 126
13  WHAT A WAY 104
14  WHAT A STORY 101
15  WHAT A PITY 95

And in case you were curious about the one-word exclamatives with actual exclamation marks:

what a|an [n*] !
1  WHAT A MESS ! 30
4  WHAT A SIGHT ! 23
5  WHAT A DAY ! 22

One of the things C&N say Rett can’t handle is something like What an extremely nice man since the extremely and nice should interfere on Rett’s account. You can’t say *John is more extremely nice than Bill or *John is too extremely nice. How does this pattern work in the data?

what a|an [r*]

Now another way of looking at stuff is to look for collocational strength.

Getting collocation info

The important stuff here is that I clicked on “COLLOCATES” and put in the part of speech (adverb=[r*]) and chose the window I was looking at–in this case two the right. I also adjusted the “MINIMUM” to be based on mutual information and I set it to ignore things with a mutual information of less than 2 (a standard strength measure is 3.0, but I wanted to get a few more than that).

A few other things:

  • You may want to restrict yourself to just SPOKEN stuff (that’s in the middle of the left-pane).
  • If you have a big query you probably want to change # HITS FREQ to something big (the default is 100).
  • Often it’s more useful to GROUP BY lemmas than words (though here it doesn’t matter, think about if I were doing something about verbs)
  • If you choose SAVE LISTS, you’ll get prompted to enter a list name ABOVE the top results. It’s really easy to miss.

But back to the results. The adverbs with the highest mutual information are truly, incredibly, wonderfully, extraordinarily, and remarkably, though the absolute counts are pretty low. Still clicking around on examples may help.

Now if we do adjectives, we get these results:

Num Adj CtTogether AdjCt Perc MI
1  [GREAT] 428    248,858 0.17 5.4
2  [WONDERFUL] 263      29,277 0.9 7.78
3  [BEAUTIFUL] 161      43,750 0.37 6.5
4  [GOOD] 133    409,451 0.03 2.99
5  [LOVELY] 102      10,246 1 7.93
6  [NICE] 100      50,448 0.2 5.6
7  [TERRIBLE] 84      20,290 0.41 6.67
8  [STRANGE] 78      26,432 0.3 6.18
9  [AMAZING] 60      17,204 0.35 6.42
10  [STUPID] 47      13,524 0.35 6.41

Notice how exclamatives skew positive. (That’s why the What a baby! trick works!)

And nouns, though let’s increase the window to 4 to the right.

Num Noun CtTogether NounCt Perc MI
1  [THING] 264    438,956 0.06 2.88
2  [DAY] 242    486,452 0.05 2.61
3  [IDEA] 209    133,349 0.16 4.27
4  [WAY] 198    521,448 0.04 2.22
5  [DIFFERENCE] 195      89,269 0.22 4.74
6  [WASTE] 166      31,419 0.53 6.02
7  [SURPRISE] 165      35,267 0.47 5.84
8  [MAN] 153    460,880 0.03 2.03
9  [SHAME] 151        9,431 1.6 7.62
10  [STORY] 148    178,875 0.08 3.34


Here is C&N’s definition of noteworthiness:

an entity is noteworthy iff its intrinsic characteristics (i.e. those char-
acteristics that are independent of the factual situation) stand out con-
siderably with respect to a comparison class of entities (C&N 2012: 5).

In the last section, we saw that what a (adj) story was a prominent use (#10). If we restrict ourselves just to the spoken portion of the corpus, this leaps up to #1. That’s because the spoken portion comes from talk shows and news programs (like Good Morning America, Dateline, and Larry King). If you look at the transcripts–and if you have ever listened to American news, you’ll know that what a (great/emotional/amazing/astonishing/inspiring) story comes up usually after the story is done and a segue is happening. And this is also true for how what a pleasure and many of the other items are. What a is used in these talk show/news programs as a way of simultaneously evaluating and moving between topics (usually out of, but also sometimes into).

All of these makes me skeptical about the definition C&N provide.

Consider this Good Morning American clip from last August (fast forward to about 56:20), where two stories, back-to-back are described in terms “what a” noteworthiness:

{Story about a woman surviving in the wilderness for 3 days}

Thank you, David.

What a story.

And what a story we have coming up for you.

{Uh, that’s about the making of a boy band.}

I would contend that these stories are not really that noteworthy (they also occur at the tail end of the show and so may be the most cuttable if things earlier had gone long). You may or may not agree with me. But at a minimum we probably need to say that such exclamatives are claims about noteworthiness, not factual observations about things that are intrinsically noteworthy. Any sort of judgment about noteworthiness has to have a judge, so that seems to be a problem for arguments about intrinsic qualities.

Part of Rett’s discussion does have the speaker in the mix (pages 4-5), but then towards the end of the paper she says of

(2) How very unexpected John’s news is!

(3) What a surprise John’s height is!

“To the extent that they sound natural, are interpreted as reflecting an objective surprise or unexpectedness rather than one oriented to the speaker” (Rett 2011: 19). Her main point is that gradable properties get their values from context–it’s not that they reflect the speaker’s attitude.

I’m a big fan of information theoretic accounts of language, which gives measures of surprise. The surprise of “x” and “y” co-occurring is based on prior probabilities of them occurring separately and together. But the truth is that they are always defined against some perceiver’s experience.  Psycholinguists use corpora to estimate how surprising word y is following word x, but if some subject had a remarkably different experience with “x” and “y” than most of us, well, we’d expect effects to be different.

After looking through the actual uses of what a, I propose you HAVE to build in the speaker. And what is more, these what a sentences really are doing more than just expressing an observation of the world. More than expressing an internal state. And more than just an evaluation. They are social in their nature (what a (stupid/amazing/dumb) thing to say), so I would contend that theories should also look at consequences of the use in terms of the relationship between the speaker and their audience. My inclination is also to believe that we ought to say something about how they tend to skew positively in terms of adjectival collocates and probably how they hold that positive-skew as a default interpretation even when there’s no adjective, as in the what a baby! example.

But what a long post this is. I’ll stop.


Sentiment corpus

25 Jan

Found this and thought I’d pass it along to folks interested in sentiment/opinion/emotion research: http://www.cyberemotions.eu/data.html.

If you’re at an academic institution, they’ll give you access to a variety of things tagged with sentiments. What you’ll get is a tag that is the average from three human beings. That’s not really a lot, though that is typical in the field right now. Each has a  1-5 positive strength score and a separate 1-5 negative strength score.

  • BBC News forum posts: 2,594,745 comments from selected BBC News forums and > 1,000 human classified sentiment strengths.
  • Digg post comments: 1,646,153 comments on Digg posts (typically highlighting news or technology stories) and > 1,000 human classified sentiment strengths.
  • MySpace (social network site) comments: six sets of systematic samples (3 for the US and 3 for the UK) of all comments exchanged between pairs of friends (about 350 pairs for each UK sample and about 3,500 pairs for each US sample) from a total of >100,000 members and > 1,000 human classified sentiment strengths.

Top five LDC corpora

30 Oct

In this post, I’d like to start off reviewing some of the most popular corpora that the Linguistics Data Consortium provides–with a few possibilities for alternatives. If you have a favorite corpus send it in!

1. TIMIT Acoustic-Phonetic Continuous Speech Corpus

If you’re interested in speech recognition, here’s one of your main resources. It’s basically 630 people (8 American dialects) reading 10 “phonetically rich sentences”. Plus these are time-aligned with transcripts (orthographic and phonetic). It’s been hand-verified and it’s pre-split into training/test subsets.

2. Web 1T 5-gram Version 1

This is basically Google n-gram stuff for English (unigrams to 5-grams). So if you want collocates and word frequencies, this is pretty good. There are 1 trillion word tokens, after all.

  • 95 billion sentences
  • 13 million unigrams
  • 1 billion 5-grams

This data was released in 2006, though, so there should be more up-to-date resources.

There’s also a 2010 (Mandarin) Chinese 5-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T06

A 2009 Japanese 7-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T08

And a 2009 “European” 5-gram on Czech Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25

3. CELEX2 (but why not try SUBTLEX?)

This corpus, circa 1996, gives you ASCII versions of three lexical databases for English, Dutch, and German. You get:

  • orthography variations
  • phonological stuff like syllables and stress
  • morphology
  • word class, argument structures
  • word frequency, lemma frequency (based on “recent and representative text corpora).

In truth, if you just want word counts for American English then consider using SUBTLEXus: http://subtlexus.lexique.org/. They make the case that CELEX is actually bad for relying on for frequency information (I’ll let you follow the link for their arguments against it and Kucera and Francis. Actually, if you go ahead and check out http://elexicon.wustl.edu/, you can download words (and non-words) with reaction times and all the morphology/phonology/syntax stuff that CELEX2 gives you.


Okay, I had never heard of this one. The main use for this corpurs is speech recognition–for digits. You get 111 men, 114 women, 50 boys, and 51 girls each pronouncing 77 different sequences of digits in 1982.

5. ECI Multilingual Text

So the European Corpus Initiative Multilingual Corpus 1 (ECI/MCI) has 46 subcorpora totally 92 million words (marked up but you can get the non-marked up stuff, too).

12 of the component corpora have parallel translated corpora from 2-9 other corpora.

Most of the stuff is journalistic, and there are some dictionaries, literature, and international organization  publications/proceedings/reports. The stuff seems to come mostly from the 1980’s and early 1990’s.

Anyone have a favorite corpus of UN delegates talking and being translated into a bunch of different languages?

Languages available: Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin Chinese, Modern Greek, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Portuguese, Russian, Serbian, Slovenian, Spanish, Standard Malay, Swedish, Turkish