Archive | July, 2013

16 places that aren’t anywhere

26 Jul

[sociable]

Idibon’s focus is on language technologies, but we also have pretty good chops when it comes to spatial data—check out our post on hosting FEMA’s aerial damage assessments following Hurricane Sandy. Geotagging gives us a way to understand language in terms of latitude/longitude. That’s often a way to make text analysis even more insightful and actionable.

Recently, while we were stitching together textual and geographic information, we found 16 places that (almost) slip through the cracks.

The Natural Earth database is a great resource for geolocation, giving the outlines of ~250 countries and territories that allow people to easy map coordinates to countries. In addition to a fairly standard English-language name, another piece of information that comes back is a country ISO code, which acts as a unique identifier so we can then pull in information from other data sources.

In particular, we use the ISO code to get information from GeoNames, which has a ton of details about places in the world (like giving alternative names for a place in lots of different languages). The vast majority of places in the Natural Earth data have ISO codes. But 16 of them don’t. We can still find these places in GeoNames, we just can’t do them by direct reference. A tour of these mismatches takes us around the world—mostly to controversial places. You’ll be well aware of some of them but others you’ve probably never heard of.

1: Siachen Glacier

The Karakoram range of the Himalayas has the highest density of tall peaks in the world (it doesn’t have Mount Everest but it has the number two peak in the world, K2). You’ll find references to it in Rudyard Kipling’s Kim and it’s a big part of Greg Mortenson’s Three Cups of Tea.

In addition to being the boundary between two colliding continents, the Karakoram has been the boundary of two colliding nuclear powers: India and Pakistan. Skirmishes are major news events in the area, but more soldiers have died from weather conditions than combat.

The Siachen Glacier is 43 miles (70 km) long, which is really big. But India, who controls the area, can expect to have less and less of it. By 2035, it’s estimated that it will be only 1/5 of its current size. (Go watch Chasing Ice.)

siachen

2 and 3: Serranilla Bank and Bajo Nuevo Bank

If you look in Wikipedia, you might wonder why Banco Serranilla is so disputed given that it is mostly underwater. Serranilla Bank has been a point of conflict between Colombia, Honduras, Nicaragua, and the US; Bajo Nuevo Bank has been disputed by Colombia, Jamaica, Nicaragua, and the US. It looks like most of these claims are pretty dormant, except for the Nicaraguan vs. Colombia claims. Colombia has been occupying both of them and last November the International Court of Justice said, yep, they had sovereignty over the areas.

bajo_nuevo_bank

4 and 5: Scarborough Reef and the Spratly Islands

There’s a lot of stuff going on in the South China Sea. The Scarborough Reef/Shoal (Huangyan Island in Chinese) involves a dispute between China, Taiwan, and the Philippines. The People’s Republic effectively has control over the area, though there were some military conflicts last year with the Philippines and it’s still an actively contested area. The shoal is about 58 square miles (150 sq km). 

The Spratly Islands cover a much bigger area: about 164,100 square miles of sea (425,000 sq km), though the land in this area is only about 1.9 square miles total (4.9 sq km). They are also more disputed: China, Taiwan, the Philippines but also Malyasia and Brunei.

west-york-island

6: Baikonur

Baikonur is the site of the Baikonur Cosmodrome (which is obviously one of the best words possible). That’s the world’s first and biggest space launch facility: Sputnik 1 and Vostok 1 both took off from here.

Baikonur is not so much a disputed city as it is a “rented” one. It’s situated in Kazakhstan but Russia administers it.

Expedition 36 Launch

7. Coral Sea Islands

These are mostly uninhabited islands and reefs northeast of Queensland, Australia. But they are involved in what is now my favorite dispute in the world because the national anthem for the underdog in the dispute is Gloria Gaynor’s version of I Am What I Am. In January of 2004, the Gay & Lesbian Kingdom of the Coral Sea Islands claimed the territory. Their first stamps were issued in 2006, “with the aim of creating a high and distinctive reputation amongst the philatelic fraternity”. Obviously, that is worth repeating: “philatelic fraternity”.

Photograph_Coral_Sea_Islands_02

8 and 9: Northern Cyprus and the Cyprus UN Buffer Zone

Cyprus gained independence from British rule in 1960, with a constitution meant to treat Greek Cypriots and Turkish Cypriots fairly. These protections started being threatened relatively soon thereafter. Then there was a coup in 1974, a possible annexation to Greece, and a Turkish invasion and…can we go back to the Coral Sea Islands? Turkey is really the only nation that recognizes Northern Cyprus as a nation of its own. It is separated from the rest of Cyprus by the UN Buffer Zone, which is about 134 square miles (346 sq km). There are about 1.1 million people on the whole island. About 300,000 in Northern Cyprus, about 10,000 of them living in the buffer zone.

UN_buffer_zone_Cyprus

10 and 11: Dhekelia and Akrotiri

Wait! We’re not done with Cyprus. When the British Empire said Cyprus could be independent in 1960 it said “Well, except we want to keep about 3% of it.” Mainly because military bases there are a great asset (you’re close to the Suez Canal).

Dhekelia

12: Clipperton Island

The French control this coral atoll (2.3 sq mi/6 sq km). Or rather coconut palms control it but the Minister of Overseas France lists it on his LinkedIn profile. Good line from Wikipedia: “It has had no permanent inhabitants since 1945. It is visited on occasion by fishermen, French Navy patrols, scientific researchers, film crews, and shipwreck survivors.”

Clipperton_Island

13: Somaliland

What we see as “Somalia” on most of today’s maps used to be two parts, one ruled by the British and one by the Italians (and actually even the Italian part, which includes Mogadishu, was eventually under British rule). “British Somaliland” was a protectorate up til 1960 (the same year Cyprus got its independence). The two parts of Somalia became independent at separate times, but joined together by the end of 1960. There was a coup in 1969 and the military took control and took the country in a communist direction (“Major General Mohamed Siad Barre, Chairman of the Supreme Revolutionary Council”).

In 1991, Barre was overthrown and the northern part of Somalia, the part that was British Somaliland, declared independence. You are probably aware of all the chaos of Somalia, most of that has been in the south. The story of the violence there is long and difficult, but Somaliland has been relatively stable and functional. No one recognizes it as a nation, however.

somaliland

14: Guantanamo Bay

You are almost certainly aware that the US Navy operates a base in Cuba known as Guantanamo Bay. It’s the largest harbor on the south side of Cuba and its steep hills keep it cut off from the rest of Cuba. The Cuban-American Treaty of 1903 gave the US a lease, but Cuba considers that treaty invalid (obtained by threats of force) and has been protesting it since 1959. The US sends Cuba a check for renting the space every month for  $4,085. Only one of these has ever been cashed (back in 1959—Castro says it was by mistake since it was still during the early part of the Cuban Revolution).

110817-N-RF645-160-Guantanamo Bay

15: Kosovo

You are also probably aware of Kosovo, an area long associated with Serbia that has an Albanian majority. Kosovo declared its independence in 2008. There are 101 countries that recognize it as such, though Serbia does not. (Northern Cyprus does but the Republic of Cyprus does not.)

kosovo_landscape

16: Indian Ocean Territory

The British claim over 21,000 square miles (54,400 sq km) of ocean, made up of about 23 sq miles of land (60 sq km). The largest island is Diego Garcia (17 sq mi/44 sq km), which the US and the UK operate as a joint military facility. The native population of Chagossians was forcefully evicted in the 1960s (to Mauritius and the Seychelles). They are making some strides in winning court battles (meanwhile there are British environmental plans that have pleasant environmental rationales but are at least partly about keeping the Chagossians at bay).

According to economist Peter Hammond, the fact that the British still control both the British Indian Ocean Territory and the Pitcairn Islands (in the Pacific) is the reason that the sun still does not, technically, set on the British Empire.

A Sailor relaxes on a sailboat in Diego Garcia.

Pitfalls of place names

Another major natural language processing task is turning places-that-come-in-the-form-of-words into places-in-the-form-of-coordinates. Consider coming across Montmartre. If I give you no other context, you’d want to guess that I’m talking about something in the north of Paris. But am I talking about the area or the hill that the area is named after? The Moulin Rouge is in Montmartre but it isn’t on Montmartre. And if the occurrence were actually part of Cafe Montmartre then it could be a number of places. Yelp tells me there’s a Cafe Montmartre in Reston, Virginia. It’s also known in literary circles as a cafe in Prague where Franz Kafka hung out. (Fair is fair: Harry’s New York Bar is most famously a bar in Paris.)

These are the kinds of reasons why you want at least a little bit more natural language processing intelligence in your Named Entity Recognition system rather than just using keywords. This also helps you know if someone talking about US is discussing the United States of America or just really yelling the first person plural.

One of the ways maps are useful is that they relate places together: New York is in New York is in the US is in North America is on planet Earth. Here again you have a little bit of trouble: there are lots of alternative words and phrases (“Big Apple”, “NY”). And what do you do with places that don’t have obvious latitude/longitude coordinates (“Atlantis”) or which have really high altitude coordinates (“Heaven”)? What do you do about places that don’t exist anymore? (“USSR”, “Free Independent Republic of West Florida”)? If you’ve a date in Constantinople she’ll be waiting in Istanbul.

– Tyler Schnoebelen (@TSchnoebelen)

ps–Special thanks to Mark Johnston for finding the mismatch that made this post possible and for his general geo-wizardry!

[sociable]

Affect and power

14 Jul

This is an ambitious post in its scope but rather loose in its structure. I want to connect affect/emotion, power, police, Stand Your Ground laws, sentiment analysis, marketing, digital humanities, and Sharknado. Underneath, I’m thinking about how expressions of emotions that we may exhibit or witness are not so much internal states made visible as they are kinds of social positionings. (You can check out my dissertation, “Emotions are relational: Positioning and the use of affective linguistic resources” if you want to see the assumptions/data/findings behind a lot of this, it’s also much more corpus linguistics than you’ll get out of this particular post.)

Police states

I want to begin by asking you to go off to The Moth to listen to Steve Osbourne, an NYPD cop, talk about doing his job. But I’ll try to make this work even if you don’t.

SteveOsborne_profile

A huge part of Osbourne’s story is answering the question, “How do you deal with all the feelings and still do what you do?” How do you see young men dead in the street and keep going? It’s something he addresses specifically and returns to throughout his narrative. Here’s how he conceptualizes it, breaking out of the narration of events on a Tuesday morning at home with his wife. That is, Osbourne finds it necessary to interrupt the order of events to make sense of them via a lengthy metacomment. This is an important way that emotions come to us in texts—not just in the reporting of events but in the present-tense evaluation of them.

And the way we do it is, you learn, very early how to shut it down. You learn how to turn off your feelings and you learn how to be professional. You learn how to do your job….Everybody thinks that, like, we build a wall between us and the public, that’s not necessarily true, what we learn to do is build a wall between ourselves and our feelings. And that’s how you stay focused, that’s how you stay professional, and that’s how you do your job. [6:03-6:42]

The idea of shutting things down, turning them off is crucial for Osbourne. So is the idea of the doing his job and being professional—that’s why and how he leaves his wife on an emotional morning. The truth is that his account of the takedown day is suffused with emotion and it would be useful to map out which ones are compatible with professionalism and which are not. Although he lumps all feelings together in his explicit words, he’s clearly not actually lumping them together over the course of the narration. But the way the narrative works is to explain how it’s required to become like stone and how cracks appear. Eyes-welling-with-tears count as cracks, but not every emotion that is implied seems to count as a crack. This suggests the importance of thinking through emotional management/regulation. Situations, identities, choices, institutions, and selves are constructed/maintained/perturbed by which types of emotions are acceptable, when, and for how long. Emotional regulation is also clearly structuring why/how/when evaluations pop up in narratives.

For Osbourne, doing his job requires shutting down feelings. He has an idea of “cop mode” being something not-human. (Implicit: feelings are human, more explicit: feelings get in the way of institutional/exceptional/quotidian duties.)

I knew that eating, like, the physical act of eating, like, putting food in your mouth, makes you feel like a human being. And I had no time for that. I had no time to be a husband, I had no time to be a human being, I was in that cop mode, you know [8:28-8:42]

This notion also reappears and, in fact, the story ends with consuming free food from a McDonald’s tent and trying to hide tears from the other cops when he sees a 5 year-old’s thank you note in his happy meal. In the story, Osbourne refuses his wife’s sandwich but accepts the McDonald’s burger. He loves free food. He feels that the McDonald’s tent shows that someone cares. Someone has thought: rescue workers need to eat. You could really go crazy with connections between consumption, the body, emotion, home, relationships, law/order, and the costs of citizenship.

Note also that we are talking about a story that is recorded and propagated through “The Moth”, which is meant to promote story-telling. Featured stories are unlikely to be randomly chosen—they likely involve that magic combination of telling people something they already know alongside some unexpected turns. Having said that, I’m not sure any of the turns in Osbourne’s story are all that unexpected and my guess is that his accent is part of what makes him compelling to both listeners and selectors-of-featured-stories-within-The-Moth. In other words, there’s a story of circulation, citationality, style and emotion here to tell, too. And I probably shouldn’t leave out class. Class, occupation, accent. Gender is part of the story and sexuality, too. We drift into how intersectional all these categories are.

Access to affect is not randomly distributed. Everyone moves around between various social position (we move ourselves, we are moved by others) but these social positions have different affective foot patrols structuring who feels what when. And we see this start-to-stop through the George Zimmerman trial (easy to slip and call it the “Trayvon Martin trial”, isn’t it?). In taking an affective focus we may turn away from the role of institutions. But these institutions structure and are structured by affects. Who is allowed to feel what when. Who gets to be afraid, who doesn’t? What emotions are “reasonable”? Stand Your Ground laws are built upon and build out differences in permitted emotions.

BPGa2YBCAAA8U1o.jpg-large

Read http://ow.ly/1ZfaNa for more; this is from the Urban Institute Justice Policy Center’s research on murders being found justifiable (SYG indicates states with Stand Your Ground laws)

I am reminded of John Gaventa’s work on power. It is surely an exercise of power to force a group of coal miners to stop striking. But there is something even more striking about the kind of power that prevents miners from even considering a work stoppage. Individuals and structures create marginal people and this is surely connected to the creation of marginal feelings. Feminist scholars considering definitions of power have wondered about adding “power to do something” to the standard definition of “power over something/someone”. In an affectively oriented research program, we might see power as the ability to block and transform feelings—in which case, the need to enter into cop mode shows power being exercised through Steve Osbourne. The feelings that a young man being hunted would feel would also be evidence of power. What feelings get marginalized in your own recountings? What feelings do you never even feel in the first place? (Here’s my longer essay on power, fwiw, dense with references.)

As I close out this section, consider the difference in our understanding of George Zimmerman’s various affective states and the ones that Steve Osbourne relates. For Osbourne, the personal need to shut down feelings is connected to a job, which is (I am presuming) connected to ideas of maintaining order and justice. This is not how Zimmerman seems to conceptualize his experience nor is it how we “read” him.

In a corpus linguistics blog, I should properly show to what degree Osbourne’s (or Zimmerman’s) narratives, metaphors, etc. are shared by others. I’m not going to do that, but a couple observations in that direction.

  • The evocative shut it down also comes up in a number of media interviews of people in law enforcement in COCA. There it can mean stopping an investigation (an internal matter) or stopping criminal activity (an external matter). It’d be interesting to look across oral histories or other data sources and build connections between these kinds of uses and the emotional kind. In Osbourne’s story, shutting his emotions down is part of his being able to do his work. But emotional regulation seems to also be connected to audiences like his wife and the people he works with.
  • Paul Ekman’s work on emotion has focused on the face, it seems that when people talk about being “like stone”, they are often talking about the face. The edifice metaphor is about nothing going on behind the hardness—or at least no signal except for “impenetrable”. Is there a propensity for this to be about what others see or about what the individual feels? Or is this even a meaningful division?

Marketing and sentiment analysis

After I added the Trayvon Martin content, I recognized that I really ought to just cut this section. Surely it is too trivial to go alongside important issues of justice. I am leaving it in because it may suggest how sentiment analysis projects could be more than buzz metrics. They could expand to complicate and clarify our notions of which individuals express which sentiments regarding which topics to which audiences. They could point us to affective social network analysis. Sentiment could be seen as not just a percentage to report but as part of a bigger process that structures the spread of information and sentiments themselves. But this is not a one-way street. Culture studies could likely benefit from the way (sophisticated) sentiment analysis tools model texts, pinpointing which specific elements convey which kinds of signal.

Sharknado trailer (Screengrab)

With that caveat, let’s take a recent tweet about the Syfy made-for-TV-movie, Sharknado, which whipped up a frenzy of activity on Twitter.

That’s friggin’ gross! And AWESOME! #Sharknado

Here’s how a sentiment analysis tool is basically going to deal with this—you decompose this into features. Somewhere there was training data that was coded. A model was built that made someone somewhat happy with the fact that when you held out parts of the training data and treated them as test data, the model didn’t get too many false positives or too many false negatives. That is, it had a reasonable level of accuracy.

A few notes:

  • Usually the features come down to individual words (the exclamation point will likely count as a word). More academic researchers will likely have bigrams, trigrams, maybe 4grams. That would be atypical in the kind of tools that a marketing person at Syfy would have access to.
  • A sophisticated system would come up with features from the data itself. But most commercial tools just involve keywords that were decided a priori to carry a signal.
  • Usually it’s just “positive, negative, neutral”, even though clearly our emotional lives are awash in more complicated states. For example teasing (potentially-positive-solidarity-through-negative) and ambivalence (both love and hate).
  • There’s clearly a signal in the all-caps. Probably most systems don’t use this.
  • Ideally, the training data is like the data you want to predict sentiment for. How useful is it to predict the sentiment of movie reviews based on book reviews or Twitter data based on Amazon data?
  • Genre plays a role beyond data source: gross in a romantic comedy is likely to be negative, while in a horror movie it’s likely to be positive.
  • How do you know what the sentiment about? Presumably it’s about some particular scene (like the hero chainsawing his way out of a great white shark). More generally, the hashtag #Sharknado is a pretty good clue.
  • Notice that the words chosen exist in a web of alternatives. The most obvious of these is friggin’ which seems to be avoiding a swear word. This choice would carry a social signal in the real world that is not disconnected from affect, though no tools I know about really know how to unite this kind of positioning with scoring sentiment.

Commercial sentiment analysis tools tend to give marketing professionals a “score” that they can track over time. Ideally, the tools give the marketing folks a way to understand what’s going wrong, what’s going right, how brand awareness is trending, and some way to actually take action to take advantage of opportunities and nip product problems in the bud. (Here a product could be a blender or a political candidate.)

To accurately capture the sentiment of the Sharknado-nado, though, we can’t really just look at individual tweets. We would need to appreciate that there was a broad group activity going on and expressed sentiments were not in a vacuum, they were part of a flurry and the delight of being caught up in that. (see also Bachorowski, Smoski, Tomarken, & Owren, 2004 on people laughing to movies more with others than alone). Participation in the #Sharknado event involved the ability to tweet, retweet, get retweeted. The ability to celebrate something terrible or take an oppositional stance. There was identity work and relationship work going on. The construction of sentiment is bigger than just “you have four positive words, two negative words, and five neutral words; based on the following weightings you feel positively”. Current systems do not know what individual tweeps’ emotional regulation schemes are (if everything I tweet about is AWESOME, is it really all that awesome? Maybe but it’s not as obvious as if I have a broader distribution of evaluations). Modeling and computing an enriched sense of context is not easy. But there is an implicit assumption that the data is disconnected from its author, its audience, and other patterns. It is not.

Some concluding thoughts

I believe these are rich sites for investigation and that I have not done them justice in this space. But I would like to say that cultural theory about emotion often seems to engage in very sophisticated ideas that are untethered from actual examples, even in digital humanities discussions. In this way, it’s not unlike the literature in computational linguistics about sentiment analysis: that literature is highly data driven, but probably suffers from being concerned more about precision/recall and feature definition than enquiry into what’s going on for the individuals whose data their findings are derived from. The former engages reproduction and power, the later can identify broad patterns and exceptions. There is interesting work ahead to combine them together. Alone, each can and does seem to float above the specific data it purports to be about. Combined together, I believe they would lead researchers back to specific processes, specific examples, specific patterns, and specific exceptions. I may be painting with too broad a brush. For the exceptional scholars, my apologies. I can see amazing possibilities right there on the horizon and I want to go to there.

Post-script

I was recently told a story about young girl whose mother put her on a good softball team. The girl was clearly the worst player on the team. This did not deter the mother who insisted her daughter play. Up to bat, the girl would shed tears with each strike and she had a lot of strikes. Everyone would look away. The story was told to me to point out that this is how we train each other when to express which emotions. But it clearly isn’t as simple as “everyone was reproducing ‘sports require toughness'”.

Some people may turn away in disapproval of the girl. But surely others of us would turn away because we know the family dynamics at play and we do not know what to do. Do you yell at the mother, do you go hug the daughter? The members of the crowd, too, are shaped by emotional regulation schemes that are broader than “what’s appropriate at a softball game”. How do you balance public facework, family sovereignty, and the like? It’s structuration: structures are built out of individual actions but those actions are structured by the actions that came before them.

These emotional regulation schemes are created, maintained, and perturbed by the stories we tell. But “a text” must be broadly construed. A story is recorded in words but during the event itself: wordless tears, wordlessly averted eyes. In the event itself, not the story of it, there are only the words strike on the field and a mother’s c’mon Sarah from the bleachers. Words are not the only speech acts, silences are, too. And those of us who make our living with words may give them a bit too much weight. They are not always the magical incantations of creation and destruction we imagine them to be. And not everyone has such imaginings. Nor do even the wordiest among us have them all the time.

Too much

Already, here below, has met its match.

Yet nothing’s gone, or nothing we recall.

And look, the stars have wound in filigree

The ancient, ageless woman of the world.

She’s seen us. She is not particular—

Everyone gets her injured, musical

“Why do you no longer come to me?”

To which there’s no reply. For here we are.

(James Merrill, The Book of Ephraim)

What have you lumped lately?

11 Jul

[sociable]

How are you carving up the world and gathering it back together? Or more specifically, when’s the last time that you said both x and y—and what were your x and y? Our dividing and lumping lines tell us a lot about how we see a situation.

A construction like both x and y is also handy for computers. Like you, computers are constantly encountering words and phrases they’ve never encountered before. But they are a lot more clunky than you are when it comes to filling in the gaps. It’s not easy to learn which parts of context to use to build up understanding. In terms of specific natural language processing applications, the both x and y construction helps do Named Entity Recognition (NER) for infrequent/new items. And it is a good testing ground for sentiment analysis: how do you best describe text that expresses both positive and negative sentiment?

In the case of both x and y, it is generally the case that x and y are the same type of thing. In other words, if you only know x, odds are that y is—at the very least—the same part of speech. For example, consider Twitter’s most popular both x and y’s for American English tweeps:

  • both good and bad
  • both men and women
  • both Twitter and Facebook
  • both love and hate
  • both on and off
  • both you and I
  • both males and females

Notice the diversity here—we’ve got adjectives, named entities, common nouns, prepositions, pronouns. If you go enter these in Twitter’s search box, you can also see that in general the construction is taking two things that are traditionally seen as separate (if not oppositional) and saying something is true for not just one of these categories but the other one, too.  In the case of both good and bad, most of the examples have to do with acceptance of everything in one’s life and/or the fact that there are a lot of experiences that are ambiguous. This kind of sentiment is common in uses of both love and hate—that is, that there are things that contain extremes and the tweep is ambivalent, which is not to say wishy-washy (etymologically, ambivalent is ‘in two ways’ + ‘strong’). [Twitter search: “both love and hate” | Twitter search: “both good and bad”]

We can confidently predict the part of speech for 2,237 of the both x and y examples. For these 64.0% of the data has x and y as the same part of speech (provided we treat pronouns like I/me/you as belonging to the same class as @twittername and proper nouns). This number goes up if we were to allow x and/or y to be more than single words (strictly speaking both China and the would be a proper noun and a determiner, but obviously this quadrigram is usually a fragment of both China and the United States, which is really “both [noun phrase] and [noun phrase]).  

bothxandyontwitter-3

Twitter is famous for its 140-character limit. What happens when you look at both x and y across books? The top x’s and y’s in the Google Books data for works in American English published 1900-2000 have these as tops:

  • both men and women
  • both before and after
  • both male and female
  • both positive and negative
  • both public and private
  • both boys and girls
  • both internal and external

There’s a lot of gender stuff going on. Next week, we’ll look at these more seriously, considering how various pairs have changed over the last one hundred years. So think of the last bullet list and the next graph as a teaser.

Note: there is a general principle of “bulky things go at the end”, so if you have a one-syllable word and a two-syllable word, you tend to put the one-syllable word first. More on this and how it affects couple namings, too. (For example, David and Juliet > Juliet and David, Chris and ColinColin and Chris, etc.) Let me know if I’ve both under and over teased you.

Google Books Ngram results for select terms, 1900-2000

Google Books Ngram results for select terms, 1900-2000, click to embiggen

– Tyler Schnoebelen (@TSchnoebelen)

[sociable]

Sexing up “6-gram”

3 Jul

[sociable/]

Most of natural language processing (NLP) is built off of unigrams—that is, single words or word-like things (like an emoticon or a !?!?). Models sometimes give you bigrams and trigrams. When people talk about these as a group, they refer to n-grams. But why use “6-gram” when you can say “sexagram”? I think you’d only say 13-gram if you were “absurdly absolutely irrationally fearful frightened and afraid of other lovely words like triskaidekaphobia” (that’s a triskaidecagram for you, you can spell it with a k or go for the resonance of a middle-c).

Note that you can swap out “-gram” for your favorite root. Why call it an 18-wheeler when you can call it a octakaidecacycle?

Using Latin and Greek roots to talk about ngrams

The Wikipedia article on numeral prefixes is pretty fun, but if you like this kind of thinking, you should check out Stephen Chrisomalis’ page on Numerical Adjectives. My favorite excerpt:

One of the most peculiar numeral words in English is “zenzizenzizenzic”, which means “the eighth power of a number”, as in “The zenzizenzizenzic of 2 is 256”. It was used only once in English, in Robert Recorde’s The Whetstone of Wit (1557). It is derived from an equally-obsolete “zenzic”, which referred to the square of a number. It is the only English word with six Zs, and is thus, if I may be allowed to coin a term, hexazetic.

– Tyler Schnoebelen (@TSchnoebelen)

[sociable/]