COCA: What a fantastic source of data!

3 Feb

Intro

425 million words from 1990-2011.

I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts of data. It’s a lot bigger than most of the corpora you may be going to now (CELEX, Switchboard, etc). And while its sentences aren’t annotated with tree structures, it does have part-of-speech info (and makes it really easy to get collocates).

This post is really about getting started with COCA, but I’ll try to do it in the framework of a particular linguistic phenomenon. But if you get nothing out of this post but USE COCA, that’ll be enough. It also makes it easy to compare to historical American English, the BNC, and Google Books/N-gram (though I won’t be showing that here).

Wh-exclamatives

A few months ago Anna Chernilovskaya came by and presented work she and and Rick Nouwe had been doing on exclamatives like:

(1) What a beautiful song John wrote!

Their work is set against Rett (2011), which sees wh-exclamatives as being a speaker expressing something that is noteworthy in the given context–in other words, to say (1) you think there’s something noteworthy about John’s song relative to the standard beauty of songs.

If you don’t have a degree adjective like “beautiful” (instead something like What a song John wrote!), Rett says you have an operator that acts like a silent adjective, so you’re exclaiming about beauty, weirdness, or complexity. Chernilovskaya and Nouwe are saying, “Nah, it’s simpler–it’s just direct noteworthiness. Drop all this degree stuff.”

My question: how are wh-exclamatives actually used by English speakers? My intuition is with C&N that it’s just about noteworthiness. But is that all?

First steps with COCA

Go to http://corpus.byu.edu/coca/, look over to the upper right–you can log in or register, as appropriate.

Most of the action is going to be in the panel on the left. Some of the stuff is hidden to reduce complexity, so if you want to see part-of-speech tags, just click on the text that says “POS LIST” and you’ll get a drop-down menu you can choose stuff from. If you want to do collocation stuff, just click “COLLOCATES”, etc.

COCA's left-pane

Let’s go ahead and start collecting examples of wh-exclamatives. In the “WORD(S)” text box, we could type:

what  [at*]

And that would get us 82,074 sentences, including both what a and what an. It would also get us what the, what every, and what no. Including sentences like Her would-be opponents are pondering these questions and what the answers mean for their own possible candidacies.

That’s rather too much. Let’s try the following–the [y*] means “any punctuation”.

[y*] what [at*]

Now we have 18,063 sentences. This may help us see that I mean, what the heck is interesting, but perhaps it’s still taking us too far afield. Let’s just do what you are probably thinking we should’ve done from the beginning:

[y*] what a|an

8,919 results. Looking through the results, this is a pretty good query. We could restrict which punctuation we care about, but if you go ahead and do this query yourself, I think you’ll see why we want to keep most punctuation.

Search results in COCA (click to make 'em bigger)

The top right box shows all the matches lumped together by punctuation/article, click on any of them and the actual sentences will show up below. Notice the drop-down Help box on the far right above the sentences. Use that to find out more about query syntax.

Noteworthiness

Let’s see what the most common words are that go with a what a|an construction. We have a few options, we can run queries like:

what a|an [j*] [n*]
what a|an [j*]
what a|an [n*]

This third one is pretty interesting because it helps us see which sorts of exclamations are made without an adjectives. (One person in our semantics/pragmatics group confided that his father always said What a baby! when confronted by an unattractive child–there’s a pressure to exclaim something but he doesn’t want to lie, so this strategy manages to give the right form without quite meaning the meaning that the parents might take away.)

Here are the top 15, though the first one ends up not really being part of our pattern a lot of the time.

1  WHAT A LOT 507
2  WHAT A DIFFERENCE 321
3  WHAT A WASTE 187
4  WHAT A MAN 183
5  WHAT A SHAME 168
6  WHAT A MESS 159
7  WHAT A PERSON 156
8  WHAT A RELIEF 152
9  WHAT A DAY 150
10  WHAT A SURPRISE 142
11  WHAT A WOMAN 126
12  WHAT A PLEASURE 117
13  WHAT A WAY 104
14  WHAT A STORY 101
15  WHAT A PITY 95

And in case you were curious about the one-word exclamatives with actual exclamation marks:

what a|an [n*] !
1  WHAT A MESS ! 30
2  WHAT A SURPRISE ! 27
3  WHAT A RELIEF ! 26
4  WHAT A SIGHT ! 23
5  WHAT A DAY ! 22

One of the things C&N say Rett can’t handle is something like What an extremely nice man since the extremely and nice should interfere on Rett’s account. You can’t say *John is more extremely nice than Bill or *John is too extremely nice. How does this pattern work in the data?

what a|an [r*]
1  WHAT A VERY 24
2  WHAT A TRULY 13
3  WHAT A REALLY 12
4  WHAT A GREAT 10
5  WHAT AN ABSOLUTELY 9
6  WHAT AN INCREDIBLY 9
7  WHAT A WONDERFULLY 7

Now another way of looking at stuff is to look for collocational strength.

Getting collocation info

The important stuff here is that I clicked on “COLLOCATES” and put in the part of speech (adverb=[r*]) and chose the window I was looking at–in this case two the right. I also adjusted the “MINIMUM” to be based on mutual information and I set it to ignore things with a mutual information of less than 2 (a standard strength measure is 3.0, but I wanted to get a few more than that).

A few other things:

  • You may want to restrict yourself to just SPOKEN stuff (that’s in the middle of the left-pane).
  • If you have a big query you probably want to change # HITS FREQ to something big (the default is 100).
  • Often it’s more useful to GROUP BY lemmas than words (though here it doesn’t matter, think about if I were doing something about verbs)
  • If you choose SAVE LISTS, you’ll get prompted to enter a list name ABOVE the top results. It’s really easy to miss.

But back to the results. The adverbs with the highest mutual information are truly, incredibly, wonderfully, extraordinarily, and remarkably, though the absolute counts are pretty low. Still clicking around on examples may help.

Now if we do adjectives, we get these results:

Num Adj CtTogether AdjCt Perc MI
1  [GREAT] 428    248,858 0.17 5.4
2  [WONDERFUL] 263      29,277 0.9 7.78
3  [BEAUTIFUL] 161      43,750 0.37 6.5
4  [GOOD] 133    409,451 0.03 2.99
5  [LOVELY] 102      10,246 1 7.93
6  [NICE] 100      50,448 0.2 5.6
7  [TERRIBLE] 84      20,290 0.41 6.67
8  [STRANGE] 78      26,432 0.3 6.18
9  [AMAZING] 60      17,204 0.35 6.42
10  [STUPID] 47      13,524 0.35 6.41

Notice how exclamatives skew positive. (That’s why the What a baby! trick works!)

And nouns, though let’s increase the window to 4 to the right.

Num Noun CtTogether NounCt Perc MI
1  [THING] 264    438,956 0.06 2.88
2  [DAY] 242    486,452 0.05 2.61
3  [IDEA] 209    133,349 0.16 4.27
4  [WAY] 198    521,448 0.04 2.22
5  [DIFFERENCE] 195      89,269 0.22 4.74
6  [WASTE] 166      31,419 0.53 6.02
7  [SURPRISE] 165      35,267 0.47 5.84
8  [MAN] 153    460,880 0.03 2.03
9  [SHAME] 151        9,431 1.6 7.62
10  [STORY] 148    178,875 0.08 3.34

Noteworthiness?

Here is C&N’s definition of noteworthiness:

an entity is noteworthy iff its intrinsic characteristics (i.e. those char-
acteristics that are independent of the factual situation) stand out con-
siderably with respect to a comparison class of entities (C&N 2012: 5).

In the last section, we saw that what a (adj) story was a prominent use (#10). If we restrict ourselves just to the spoken portion of the corpus, this leaps up to #1. That’s because the spoken portion comes from talk shows and news programs (like Good Morning America, Dateline, and Larry King). If you look at the transcripts–and if you have ever listened to American news, you’ll know that what a (great/emotional/amazing/astonishing/inspiring) story comes up usually after the story is done and a segue is happening. And this is also true for how what a pleasure and many of the other items are. What a is used in these talk show/news programs as a way of simultaneously evaluating and moving between topics (usually out of, but also sometimes into).

All of these makes me skeptical about the definition C&N provide.

Consider this Good Morning American clip from last August (fast forward to about 56:20), where two stories, back-to-back are described in terms “what a” noteworthiness:

{Story about a woman surviving in the wilderness for 3 days}

Thank you, David.

What a story.

And what a story we have coming up for you.

{Uh, that’s about the making of a boy band.}

I would contend that these stories are not really that noteworthy (they also occur at the tail end of the show and so may be the most cuttable if things earlier had gone long). You may or may not agree with me. But at a minimum we probably need to say that such exclamatives are claims about noteworthiness, not factual observations about things that are intrinsically noteworthy. Any sort of judgment about noteworthiness has to have a judge, so that seems to be a problem for arguments about intrinsic qualities.

Part of Rett’s discussion does have the speaker in the mix (pages 4-5), but then towards the end of the paper she says of

(2) How very unexpected John’s news is!

(3) What a surprise John’s height is!

“To the extent that they sound natural, are interpreted as reflecting an objective surprise or unexpectedness rather than one oriented to the speaker” (Rett 2011: 19). Her main point is that gradable properties get their values from context–it’s not that they reflect the speaker’s attitude.

I’m a big fan of information theoretic accounts of language, which gives measures of surprise. The surprise of “x” and “y” co-occurring is based on prior probabilities of them occurring separately and together. But the truth is that they are always defined against some perceiver’s experience.  Psycholinguists use corpora to estimate how surprising word y is following word x, but if some subject had a remarkably different experience with “x” and “y” than most of us, well, we’d expect effects to be different.

After looking through the actual uses of what a, I propose you HAVE to build in the speaker. And what is more, these what a sentences really are doing more than just expressing an observation of the world. More than expressing an internal state. And more than just an evaluation. They are social in their nature (what a (stupid/amazing/dumb) thing to say), so I would contend that theories should also look at consequences of the use in terms of the relationship between the speaker and their audience. My inclination is also to believe that we ought to say something about how they tend to skew positively in terms of adjectival collocates and probably how they hold that positive-skew as a default interpretation even when there’s no adjective, as in the what a baby! example.

But what a long post this is. I’ll stop.

2 Responses to “COCA: What a fantastic source of data!”

Trackbacks/Pingbacks

  1. You’ve got a text, now get easy frequency and collocation information « Corpus linguistics - February 21, 2012

    […] can find my intro to the Corpus of Contemporary American English here, but there’s a related site called http://www.wordandphrase.info that will let you enter a […]

  2. Hwaet! Old English corpora and a quick look at my favorite word in Beowulf « Corpus linguistics - March 11, 2012

    […] if you’re interested in modern what, see my post exploring the what a __! construction. (If what interests you, also check out the Oxford English Dictionary, though the entry is […]

Leave a comment