Archive | January, 2013

Hashtaghashtag: A summary of #LSA2013, #MLA13, and a bit about TwitteR package for R

Linguists had a big conference in Boston this past weekend and they got together to vote “#hashtag” as the 2012 Word of the Year. (My own Twitter pleas for “Honey Boo Boo” as WOTY went unheeded.)

Fun summary of the show-down between “#hashtag” and “marriage equality”

This post is a quick summary of what went on in Twitter in the big conferences for the Linguistics Society of America (LSA) and the Modern Languages Association (MLA). At the bottom of the post, I also show how to grab this kind of data using TwitteR for R.

For the data, I restrict myself to everything with that was marked #lsa2013 or #mla13. (I’m doing a sleight of hand here–the WOTY was part of the American Dialect Society, which runs its conference alongside the LSA.)

From Jan 3 to Jan 8, there were 872 tweets with #lsa2013 (I’m including RTs in this number–drop anything with any “RT” and you’re down to 584 tweets). After removing common words, here’s the word cloud–click to enlarge.

Since linguists liked hashtags for WOTY–how many did they use? Holding aside “#lsa2013” (since it is defined to be part of every tweet I’m looking at), there were 75 different hashtags, used a total of 305 times. The most popular–as you can see in a stripped form in the word cloud, were #ads2013, #woty12, #ads, #woty2012 (most of these were people tagging tweets about the WOTY vote with multiple tags). Also fairly popular: #mla13, which was the tag for the Modern Languages Association’s annual meeting, which was happening at the same time, also in Boston.

Linguists did tweet about findings in presentations they were giving/watching, but not really all that many and there isn’t really enough activity in any particular topic area/sub-discipline to puff anything in the word cloud up. Plenaries do get the most live tweeting, as you’d expect. If “variation” or “rickford” catch your eye, see my summary of NWAV 41 from November, which is the big sociolinguistics conference in the US.

Our colleagues in the MLA were a bit more reporterly and critiquey. That is, qualitatively, I think they had a lot of really interesting conversations and observations. The LSA tweets were more independent of each other.

The MLA folks are also a lot more garrulous–their conference was longer but they’d still win if you restricted to even just a day of data. Here, I’m reporting data from the first 1,580 tweets per day from Jan-3 to today (why this restriction? Check out the mini-tutorial below). There were 5,664 #mla13 tweets–3,774 if we remove retweets marked with “RT”…note that this means that both conferences were similarly retweety–about 33% of tweets at both conferences involved retweeting.

Notice that the MLA folks had a strong convention to mark the session they were in (that’s the big #s112, etc). They had 17 sessions hashtagged more than 50 times. Overall, they had 472 different hashtags, used a total of 4,870 times.

In corpus linguistics it’s useful to distinguish “types” (like individual words) from “tokens” (uses of those types). The ratio of hashtag TYPES to tweets is similar for LSA and MLA people, but the MLA folks are using theirs a lot more. (Again, I think this has to do with the fact that the MLA folks were consistently labeling their sessions and doing more conversational stuff than the linguists.)

The biggest other hashtags for the MLA folks were #altac (the alternative academic movement), #nerduendos (sexual innuendos by nerds) and #elit (electronic literature, like these love letters). On the LSA side, I do wish that @dsbigham‘s “Boston is Burning” hashtag had caught on: #sweatervestrealness.

Btw, even though badges to one conference got people into the other, there really were only 14 tweets that had both tags. Maybe there were people going back and forth, but they weren’t tweeting about it. #MissedOpportunities

Finally, a little bit about the “who”. 153 different people used the #lsa2013 tag, the most prolific was @sociolx, who in real life is David Bowie but the linguist-who-lives-in-Alaska-and-presented-about-how-young-Alaskans-have-vowels-like-coastal-Californians, though obviously he is often forced to live in the shadow of the-happy-66th-birthday-singer-actor-Goblin-king. Bowie had over 3 times as many tweets as the next person.

Here are all the linguists with 20+ tweets (go follow them if you haven’t already).

Over in the MLA, there were 1,024 different people using the #mla13 tag. Here are the ones that had 50+, go follow them, too.

How to do this yourself

Initially, I just grabbed the LSA tweets by going to Twitter’s web search, searching for “All” (not Top) and then copying and pasting. That’s a kind of dumb way to do things but I didn’t mind. But that doesn’t work for all the bazillions of MLA tweets. So I used the TwitteR package for R.

Open R
Install packages that you need (TwitteR, ROAuth, digest, bitops, RCurl, rjson)
Load the packages mentioned above, e.g., “library(twitteR)”
Now, you can use OAuth to get some fancier capabilities. That’s going to involve going to Twitter and registering as a Developer. I ran into a dumb certificate problem, so I ended up not doing that. That’s why for the MLA I don’t have “all” the data. But since this is just meant to be a quick project, I decided to let it be.
Now we start the actual searching through the Twitter API.

mlaJan2to3<-searchTwitter("#mla13", n=1580, since="2013-01-02", until="2013-01-03"

Do this for each date (the LSA data is small enough that you don’t have to specify an “until” date and you can still get everything).

Now you want to turn this into a “data frame” so it’s easier to deal with.

mla1<-twListToDF(mlaJan3to4)

And do this for each one.

Now combine them

mla<-rbind(mla1, mla2, mla3, mla4, mla5)

And write it as a table. If you like Excel, it’ll be easiest if you use some sort of separator like “|”, which I do here. Regardless, there will be some clean-up you’re going to have to do.

write.table(mla, file="mla13hashtagtweets.txt",sep="|")

(Personally, I’d create one file for all the MLA stuff and one file for all the LSA stuff.)

Comments Leave a Comment
Categories Uncategorized

Aww, hmmm, ohh heyyy nooo omggg!

8 Jan

Tonight, Stacy Dickerman (@linguajinks) noticed a friend on Twitter use “DUMBBB”. A succession of observations followed:

“Just saw a friend tweet “dumb” as “DUMBBB” which now has me thinking about stylistic repetition of letters on Twitter, esp. letter choice.”
“In this case, word boundary seems to influence letter choice more than stressed phono-orthographic element. (cf. “DUMMMB” or “DUUUMB”)”
“Also interesting: “DUUUMB” looks really weird to me even though the vowel is probs the sound I’m most likely to lengthen in actual speech.”
“Three “U”s look totally fine in “DUUUMMMMMMBBBBBBB” however (U=3, M=6, B=7). Intriguing.”

One of the things I looked at in my dissertation was this kind of “expressive lengthening” (I sometimes call it “affective lengthening” but I’m pretty sure “expressive lengthening” is better).

The most straight-forward thing to track is three-repeated letters (because we have sooo many words like groove and flipped and erroneous that have legitimate double-letters).

So let me first focus on three-letters-or-more. In that case here are the frequencies of how many different words there are in my data:

	Count of diff words
ooo+	89
eee+	52
mmm+	16
sss+	11
rrr+	6
ppp+	6
nnn+	5
ttt+	4
ggg+	4
fff+	2
ddd+	2
(Total)	197

The most popular words are:

awww
aaa
hmmm
mmm
soo
ahhh
xxx (if you want to count this)
ohhh
oooh
soooo

As you can see, the only one of these that is derived from a “normal” word is sooo.

My guess is that expressive lengthening starts with expressive sound-words like aww and then moves over to shorter words of affirmation/negation/politeness: nooo, heyyy, yeahh, yesss, meee are all frequent.

Then there’s a further extension that I personally love. lmaooo, loool, lolll, and omggg are also popular. This is awesome because speech and writing get disconnected even though expressive lengthening probably began as a way of indicating spoken length (“I want you to hear what I’ve written”). Some people do say lol aloud, but no one (to my knowledge) uses lmao or omg in speech without spelling them.

That omggg may have caught your attention because it sort of looks like someone is trying to lengthen a hard-g. You can’t lengthen a hard-g. Try it. You can repeat/stutter sounds like /g/, /t/, /b/, but you can’t hold them like a vowel or a nasal (m, n) or a sibilant (/s/, /z/).

But you get people doing this. My favorites are uppp, happpy, yuppp, nighttt, shittt, anddd, amazinggg.

I’ll close up now, but put the triples in context of doubles-we-know-are-doubles (so I’m not reporting ff or oo).

	Count of diff words
ooo+	89
hh+	75
aa+	59
eee+	52
yy+	42
ww+	34
lll+	16
mmm+	16
xx+	13
uu+	13
ii+	12
sss+	11
kk+	10
ppp+	6
rrr+	6
nnn+	5
ttt+	4
ggg+	4
aa+/hh+	3
fff+	2
ddd+	2
(Total)	474

PS–Forms like “DUUUMMMMMMBBBBBBB” are pretty uncommon. The main words that have multiple different letters repeated are aawww, ooohh, awwhh, aahh, ooohhh, aaww, hahhaa. There are probably more like wooohooo but I haven’t annotated those distinctively.

Comments 1 Comment
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Hashtaghashtag: A summary of #LSA2013, #MLA13, and a bit about TwitteR package for R

How to do this yourself

Aww, hmmm, ohh heyyy nooo omggg!

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?