Brendan O’Connor & Co. from CMU have updated their tweet parser and provided a bunch of other stuff, including a collection of 56 million English-language tweets.
They’ve also done some clustering work on the words. Some of their clusters make a lot of sense immediately:
- haven’t havent shoulda would’ve should’ve hadn’t woulda could’ve coulda havnt shouldve wouldve must’ve musta couldve haven’t havn’t hadnt might’ve hvnt mustve shuda wudashudda wudda shulda wulda mighta cudda have’nt wudve shudve hvent #glocalurban hadn’t haven`t mightve shlda haven´t culda should’ve wlda avnt would’ve hvn’t may’ve cudveshldve have’t could’ve
Others are intriguing, for example, I believe gaydar may be an actual body part, given its cluster:
- body brain soul skin stomach throat belly tummy ego imagination gut liver jaw spine bladder handwriting scalp body’s subconscious uterus complexion stomache eyesight naveltorso palate bodys demeanor physique waistline clitoris abdomen spleen gaydar gallbladder pocketbook bdy bodyy tummy’s tailbone ringback ribcage cervix skinn throat’sescentuals skin’s sternum ellum cell’s
Btw, look at all the ways to put lol in the past tense!
- looked felt laughed yelled tasted screamed smiled smelled acted shouted stared waved lol’d smelt bitched giggled winked loled lookd behaves glanced chuckled honked barkedmoaned growled peeked blushed beeped lol’ed squealed gasped hollered cringed whistled whined glared lold grinned smirked hissed snored lolled holla’d lol-ed laffed meowedstuttered groaned flinched
The clusters in HTML: http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html