The Corpora list (join or search it here, really, it’s full of stuff).
One recent discussion is about “TTR”, which is an old school way of measuring the lexical diversity of some text. The abbreviation stands for “type token ratio”, so basically you look at a text and say there are x many unique word types and then you divide that by the number of tokens.
That’s pretty easy to calculate, but as people on the list point out, what the hell are you going to use it for? Let’s say you want to compare some novels or you want to compare some transcribed speech from kids you’re worried about. The TTR is going to be really dependent on how much data you have. So if you want any sort of stats, you need to have equal size text samples. (So you’d sample each text for the number of tokens of your smallest text.)
As the thread points out, you probably want to check out Tweedie and Baayen (1998) on “How variable may a constant be? Measures of lexical richness in perspective“.
But in terms of actual implementation of TTR and alternative measures, I would steer you to chapter 6.5 of Baayen’s introduction to linguistic analysis in R.