COCA: What a fantastic source of data!

Intro

425 million words from 1990-2011.

I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts of data. It’s a lot bigger than most of the corpora you may be going to now (CELEX, Switchboard, etc). And while its sentences aren’t annotated with tree structures, it does have part-of-speech info (and makes it really easy to get collocates).

This post is really about getting started with COCA, but I’ll try to do it in the framework of a particular linguistic phenomenon. But if you get nothing out of this post but USE COCA, that’ll be enough. It also makes it easy to compare to historical American English, the BNC, and Google Books/N-gram (though I won’t be showing that here).

Wh-exclamatives

A few months ago Anna Chernilovskaya came by and presented work she and and Rick Nouwe had been doing on exclamatives like:

(1) What a beautiful song John wrote!

Their work is set against Rett (2011), which sees wh-exclamatives as being a speaker expressing something that is noteworthy in the given context–in other words, to say (1) you think there’s something noteworthy about John’s song relative to the standard beauty of songs.

If you don’t have a degree adjective like “beautiful” (instead something like What a song John wrote!), Rett says you have an operator that acts like a silent adjective, so you’re exclaiming about beauty, weirdness, or complexity. Chernilovskaya and Nouwe are saying, “Nah, it’s simpler–it’s just direct noteworthiness. Drop all this degree stuff.”

My question: how are wh-exclamatives actually used by English speakers? My intuition is with C&N that it’s just about noteworthiness. But is that all?

First steps with COCA

Go to http://corpus.byu.edu/coca/, look over to the upper right–you can log in or register, as appropriate.

Most of the action is going to be in the panel on the left. Some of the stuff is hidden to reduce complexity, so if you want to see part-of-speech tags, just click on the text that says “POS LIST” and you’ll get a drop-down menu you can choose stuff from. If you want to do collocation stuff, just click “COLLOCATES”, etc.

COCA's left-pane

Let’s go ahead and start collecting examples of wh-exclamatives. In the “WORD(S)” text box, we could type:

what  [at*]

And that would get us 82,074 sentences, including both what a and what an. It would also get us what the, what every, and what no. Including sentences like Her would-be opponents are pondering these questions and what the answers mean for their own possible candidacies.

That’s rather too much. Let’s try the following–the [y*] means “any punctuation”.

[y*] what [at*]

Now we have 18,063 sentences. This may help us see that I mean, what the heck is interesting, but perhaps it’s still taking us too far afield. Let’s just do what you are probably thinking we should’ve done from the beginning:

[y*] what a|an

8,919 results. Looking through the results, this is a pretty good query. We could restrict which punctuation we care about, but if you go ahead and do this query yourself, I think you’ll see why we want to keep most punctuation.

Search results in COCA (click to make 'em bigger)

The top right box shows all the matches lumped together by punctuation/article, click on any of them and the actual sentences will show up below. Notice the drop-down Help box on the far right above the sentences. Use that to find out more about query syntax.

Noteworthiness

Let’s see what the most common words are that go with a what a|an construction. We have a few options, we can run queries like:

what a|an [j*] [n*]

what a|an [j*]

what a|an [n*]

This third one is pretty interesting because it helps us see which sorts of exclamations are made without an adjectives. (One person in our semantics/pragmatics group confided that his father always said What a baby! when confronted by an unattractive child–there’s a pressure to exclaim something but he doesn’t want to lie, so this strategy manages to give the right form without quite meaning the meaning that the parents might take away.)

Here are the top 15, though the first one ends up not really being part of our pattern a lot of the time.

1	WHAT A LOT	507
2	WHAT A DIFFERENCE	321
3	WHAT A WASTE	187
4	WHAT A MAN	183
5	WHAT A SHAME	168
6	WHAT A MESS	159
7	WHAT A PERSON	156
8	WHAT A RELIEF	152
9	WHAT A DAY	150
10	WHAT A SURPRISE	142
11	WHAT A WOMAN	126
12	WHAT A PLEASURE	117
13	WHAT A WAY	104
14	WHAT A STORY	101
15	WHAT A PITY	95

And in case you were curious about the one-word exclamatives with actual exclamation marks:

what a|an [n*] !

1	WHAT A MESS !	30
2	WHAT A SURPRISE !	27
3	WHAT A RELIEF !	26
4	WHAT A SIGHT !	23
5	WHAT A DAY !	22

One of the things C&N say Rett can’t handle is something like What an extremely nice man since the extremely and nice should interfere on Rett’s account. You can’t say *John is more extremely nice than Bill or *John is too extremely nice. How does this pattern work in the data?

what a|an [r*]

1	WHAT A VERY	24
2	WHAT A TRULY	13
3	WHAT A REALLY	12
4	WHAT A GREAT	10
5	WHAT AN ABSOLUTELY	9
6	WHAT AN INCREDIBLY	9
7	WHAT A WONDERFULLY	7

Now another way of looking at stuff is to look for collocational strength.

Getting collocation info

The important stuff here is that I clicked on “COLLOCATES” and put in the part of speech (adverb=[r*]) and chose the window I was looking at–in this case two the right. I also adjusted the “MINIMUM” to be based on mutual information and I set it to ignore things with a mutual information of less than 2 (a standard strength measure is 3.0, but I wanted to get a few more than that).

A few other things:

You may want to restrict yourself to just SPOKEN stuff (that’s in the middle of the left-pane).
If you have a big query you probably want to change # HITS FREQ to something big (the default is 100).
Often it’s more useful to GROUP BY lemmas than words (though here it doesn’t matter, think about if I were doing something about verbs)
If you choose SAVE LISTS, you’ll get prompted to enter a list name ABOVE the top results. It’s really easy to miss.

But back to the results. The adverbs with the highest mutual information are truly, incredibly, wonderfully, extraordinarily, and remarkably, though the absolute counts are pretty low. Still clicking around on examples may help.

Now if we do adjectives, we get these results:

Num	Adj	CtTogether	AdjCt	Perc	MI
1	[GREAT]	428	248,858	0.17	5.4
2	[WONDERFUL]	263	29,277	0.9	7.78
3	[BEAUTIFUL]	161	43,750	0.37	6.5
4	[GOOD]	133	409,451	0.03	2.99
5	[LOVELY]	102	10,246	1	7.93
6	[NICE]	100	50,448	0.2	5.6
7	[TERRIBLE]	84	20,290	0.41	6.67
8	[STRANGE]	78	26,432	0.3	6.18
9	[AMAZING]	60	17,204	0.35	6.42
10	[STUPID]	47	13,524	0.35	6.41

Notice how exclamatives skew positive. (That’s why the What a baby! trick works!)

And nouns, though let’s increase the window to 4 to the right.

Num	Noun	CtTogether	NounCt	Perc	MI
1	[THING]	264	438,956	0.06	2.88
2	[DAY]	242	486,452	0.05	2.61
3	[IDEA]	209	133,349	0.16	4.27
4	[WAY]	198	521,448	0.04	2.22
5	[DIFFERENCE]	195	89,269	0.22	4.74
6	[WASTE]	166	31,419	0.53	6.02
7	[SURPRISE]	165	35,267	0.47	5.84
8	[MAN]	153	460,880	0.03	2.03
9	[SHAME]	151	9,431	1.6	7.62
10	[STORY]	148	178,875	0.08	3.34

Noteworthiness?

Here is C&N’s definition of noteworthiness:

an entity is noteworthy iff its intrinsic characteristics (i.e. those char-
acteristics that are independent of the factual situation) stand out con-
siderably with respect to a comparison class of entities (C&N 2012: 5).

In the last section, we saw that what a (adj) story was a prominent use (#10). If we restrict ourselves just to the spoken portion of the corpus, this leaps up to #1. That’s because the spoken portion comes from talk shows and news programs (like Good Morning America, Dateline, and Larry King). If you look at the transcripts–and if you have ever listened to American news, you’ll know that what a (great/emotional/amazing/astonishing/inspiring) story comes up usually after the story is done and a segue is happening. And this is also true for how what a pleasure and many of the other items are. What a is used in these talk show/news programs as a way of simultaneously evaluating and moving between topics (usually out of, but also sometimes into).

All of these makes me skeptical about the definition C&N provide.

Consider this Good Morning American clip from last August (fast forward to about 56:20), where two stories, back-to-back are described in terms “what a” noteworthiness:

{Story about a woman surviving in the wilderness for 3 days}

Thank you, David.

What a story.

And what a story we have coming up for you.

{Uh, that’s about the making of a boy band.}

I would contend that these stories are not really that noteworthy (they also occur at the tail end of the show and so may be the most cuttable if things earlier had gone long). You may or may not agree with me. But at a minimum we probably need to say that such exclamatives are claims about noteworthiness, not factual observations about things that are intrinsically noteworthy. Any sort of judgment about noteworthiness has to have a judge, so that seems to be a problem for arguments about intrinsic qualities.

Part of Rett’s discussion does have the speaker in the mix (pages 4-5), but then towards the end of the paper she says of

(2) How very unexpected John’s news is!

(3) What a surprise John’s height is!

“To the extent that they sound natural, are interpreted as reflecting an objective surprise or unexpectedness rather than one oriented to the speaker” (Rett 2011: 19). Her main point is that gradable properties get their values from context–it’s not that they reflect the speaker’s attitude.

I’m a big fan of information theoretic accounts of language, which gives measures of surprise. The surprise of “x” and “y” co-occurring is based on prior probabilities of them occurring separately and together. But the truth is that they are always defined against some perceiver’s experience. Psycholinguists use corpora to estimate how surprising word y is following word x, but if some subject had a remarkably different experience with “x” and “y” than most of us, well, we’d expect effects to be different.

After looking through the actual uses of what a, I propose you HAVE to build in the speaker. And what is more, these what a sentences really are doing more than just expressing an observation of the world. More than expressing an internal state. And more than just an evaluation. They are social in their nature (what a (stupid/amazing/dumb) thing to say), so I would contend that theories should also look at consequences of the use in terms of the relationship between the speaker and their audience. My inclination is also to believe that we ought to say something about how they tend to skew positively in terms of adjectival collocates and probably how they hold that positive-skew as a default interpretation even when there’s no adjective, as in the what a baby! example.

But what a long post this is. I’ll stop.

Tags: coca, emotion, english, exclamatives, fav, pragmatics, semantics, syntax, wh-words

Comments 2 Comments
Categories Uncategorized

2 Responses to “COCA: What a fantastic source of data!”

Trackbacks/Pingbacks

You’ve got a text, now get easy frequency and collocation information « Corpus linguistics - February 21, 2012
[…] can find my intro to the Corpus of Contemporary American English here, but there’s a related site called http://www.wordandphrase.info that will let you enter a […]
Hwaet! Old English corpora and a quick look at my favorite word in Beowulf « Corpus linguistics - March 11, 2012
[…] if you’re interested in modern what, see my post exploring the what a __! construction. (If what interests you, also check out the Oxford English Dictionary, though the entry is […]

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

COCA: What a fantastic source of data!

Intro

Wh-exclamatives

First steps with COCA

Noteworthiness

Noteworthiness?

2 Responses to “COCA: What a fantastic source of data!”

Trackbacks/Pingbacks

Leave a comment Cancel reply

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?

Search

COCA: What a fantastic source of data!

Intro

Wh-exclamatives

First steps with COCA

Noteworthiness

Noteworthiness?

Share this:

Related

2 Responses to “COCA: What a fantastic source of data!”

Trackbacks/Pingbacks

Leave a comment Cancel reply

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?