Vocabulary richness

21 Nov

The Corpora list (join or search it here, really, it’s full of stuff).

One recent discussion is about “TTR”, which is an old school way of measuring the lexical diversity of some text. The abbreviation stands for “type token ratio”, so basically you look at a text and say there are x many unique word types and then you divide that by the number of tokens.

That’s pretty easy to calculate, but as people on the list point out, what the hell are you going to use it for? Let’s say you want to compare some novels or you want to compare some transcribed speech from kids you’re worried about. The TTR is going to be really dependent on how much data you have. So if you want any sort of stats, you need to have equal size text samples. (So you’d sample each text for the number of tokens of your smallest text.)

As the thread points out, you probably want to check out Tweedie and Baayen (1998) on “How variable may a constant be? Measures of lexical richness in perspective“.

But in terms of actual implementation of TTR and alternative measures, I would steer you to  chapter 6.5 of Baayen’s introduction to linguistic analysis in R.

(Also see Benjamin Allison’s post for some thoughts about how to measure vocabulary richness. And David Hoover’s 2003 work on vocabulary richness measures.)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: