- Dirty words, taboo words, swear words, etc. I’ve found these lists particularly helpful in understanding how people are using Twitter data (btw, there’s COPIOUS stuff on Twitter on Infochimps).
- Speaking of bad words, you can also get all those Enron emails at Infochimps (but we have it at Stanford, too: /afs/ir/data/linguistic-data/Enron-Email-Corpus).
- And speaking of voyeurism, there’s also a corpus of erotica available. (There’s some pretty out-there stuff in it, but saying that may only make you more interested. Update: a quick analysis of this corpus.)
Some of you eat with finger bowls and extended pinkies–does Infochimps have less smutty stuff?
- One of the interests in the NLP group is understanding how academic collaboration works (for example, Johri et al., 2011). A similar question could be asked by looking at a million different syllabi.
- There’s lots of newsgroup stuff out there (besides porn), for example: here and here.
- And there are also summaries of headlines and news articles.
Finally, some stuff on accents:
- Infochimps HAS lots of data and they also LINK to lots of data. For example, the Speech Accent Archive, which has over 1,200 people reading a paragraph in English. They’re from all over the place (native and non-native English speakers).
- There’s also a meme on YouTube that you could use. Search for “What’s my accent?” (Here are some steps I wrote up about how to download YouTube audio so you can play with it in Praat).
- And, okay, the whole reason you should’ve been reading this post is to get the following link to “Pronunciation Manual“. If you haven’t seen it yet: it’s honey-badger good. Happy Friday!