Tweet parser and word clusters

22 Sep

Brendan O’Connor & Co. from CMU have updated their tweet parser and provided a bunch of other stuff, including a collection of 56 million English-language tweets.

They’ve also done some clustering work on the words. Some of their clusters make a lot of sense immediately:

  • haven’t havent shoulda would’ve should’ve hadn’t woulda could’ve coulda havnt shouldve wouldve must’ve musta couldve haven’t havn’t hadnt might’ve hvnt mustve shuda wudashudda wudda shulda wulda mighta cudda have’nt wudve shudve hvent #glocalurban hadn’t haven`t mightve shlda haven´t culda should’ve wlda avnt would’ve hvn’t may’ve cudveshldve have’t could’ve

Others are intriguing, for example, I believe gaydar may be an actual body part, given its cluster:

  • body brain soul skin stomach throat belly tummy ego imagination gut liver jaw spine bladder handwriting scalp body’s subconscious uterus complexion stomache eyesight naveltorso palate bodys demeanor physique waistline clitoris abdomen spleen gaydar gallbladder pocketbook bdy bodyy tummy’s tailbone ringback ribcage cervix skinn throat’sescentuals skin’s sternum ellum cell’s

Btw,  look at all the ways to put lol in the past tense!

  • looked felt laughed yelled tasted screamed smiled smelled acted shouted stared waved lol’d smelt bitched giggled winked loled lookd behaves glanced chuckled honked barkedmoaned growled peeked blushed beeped lol’ed squealed gasped hollered cringed whistled whined glared lold grinned smirked hissed snored lolled holla’d       lol-ed laffed meowedstuttered groaned flinched

The clusters in HTML: http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html


Geography stuff for dialectology

14 Mar

InfoChimps have a geography API that might help you plot people against locations:


Brice Russ at OSU has been doing Twitter dialect stuff and has been using the Data Science Toolkit, but says the InfoChimps looks more user-friendly. (That was an initial look, though, so hit him up at @KilroyWasHere on Twitter to learn more from him.)