Social media more complex than great literature

11 Apr

{Originally written for Idibon.com}

The Washington Post recently published an article about online/offline reading differences. Are our reading abilities changing?

“They cannot read ‘Middlemarch.’ They cannot read William James or Henry James,” Wolf said. “I can’t tell you how many people have written to me about this phenomenon. The students no longer will or are perhaps incapable of dealing with the convoluted syntax and construction of George Eliot and Henry James.”

So this post tries to answer the question:

Is social media harder or easier to read than Henry James and George Eliot?

Now to refresh your memory, here’s the opening of George Eliot’s Middlemarch:

Who that cares much to know the history of man, and how the mysterious mixture behaves under the varying experiments of Time, has not dwelt, at least briefly, on the life of Saint Theresa, has not smiled with some gentleness at the thought of the little girl walking forth one morning hand-in-hand with her still smaller brother, to go and seek martyrdom in the country of the Moors?

I mean, just yesterday.

Syntax is dead and also very interesting

A dirty secret of computational linguistics is similar to the WaPo article: “syntax is dead”. Except what computational linguists mean isn’t that syntax is really dead but that we avoid actual syntactic parsing in our machine learning models, getting the same information extraction accuracy with much simpler features, like sequences of words, clusters of words and phrases, and their relative positions in sentences.

But that doesn’t mean there isn’t a lot of insight to gain from thinking syntactically. I’m not going to directly quote the opening of Henry James’ The Ambassadors but let me recommend Ian Watt’s 1960 essay about its first paragraph. The Ambassadors has a boring beginning, which even Watt grants. But under his careful explication you come alive to the subtle humor and the way the author has knitted together ambivalences and bewilderment. Importantly, Watt mentions the difficulty of Henry James three times. That’s a literary critic, professor of English at Berkeley (later Stanford) saying James was hard. In 1960, long before the Galaxy, the iPad, and the Kindle.

One of the reasons for the special demand James’s fictional prose makes on our attention is surely that there are always at least three levels of development—all of them subjective: the characters’ awareness of events: the narrator’s seeing of them; and our own trailing perception of the relation between these two.

Of course students today have difficulty with Henry James. People have always had difficulty with Henry James.

Watt’s essay is famous in literature circles because it is such a careful textual analysis. From a linguist’s perspective, it is rich with hypotheses of how language works. Watt’s main points are that James is being humorous as he introduces his protagonist but also compassionate. Here are the linguistic resources he mentions:

  • Sentence lengths and how they change
  • Delayed specification of referents
  • A preference for abstraction: non-transitive verbs, abstract nouns
  • An enormous use of that (see also Who is the Sarah Palin of the Canterbury Tales?)
  • Variation to avoid piling up personal pronouns
  • Lots of negatives and near-negatives
  • Odd placement of words
  • A density of parentheticals (the last sentence of the first paragraph, in particular)

In general, if someone says their system can automatically detect irony or sarcasm, you should grab your wallet and run. But Watt has an interesting observation (in a similar spirit to Maynard Mach in 1948): “The application of abstract diction to particular persons always tends towards irony, because it imposes a dual way of looking at them: few of us can survive being presented as general representatives of humanity.” As he later develops, the classic posture of irony is to know something about someone that is more than they know about themselves.

In general, this suggests the importance of tracking “distancing devices” in language, or more generally, the way people use language to position themselves, their topics, and their audiences.

Tweets are harder to read than George Eliot

Syntactic parsing is a good way of measuring complexity but it’s difficult to get it to be accurate and is especially unreliable for tweets (the top compling parsers are listed here; for Twitter, the best resource is CMU’s Twitter part-of-speech tagger).

Psycholinguists have long known that long strings of long words are hard for people to read. So rather than operationalize “difficult syntax”, I’ll take an easier route and use the Flesch Reading Ease metric of readability. Idibon also has other metrics of readability, standard ones and ones that take into consideration word frequency since common words are easier to read than uncommon ones, and some machine-learning driven readability metrics of our own. But for this post, we’ll stick to Flesch so that you can repeat our experiment at home.

I picked out the 10 works of fiction that are the most popular on Project Gutenberg right now, added 5 of the most popular Henry James and 5 of the most popular George Eliot. I also grabbed 400,000 random tweets and segmented them into sentences (a tweet lacking any punctuation was considered a full sentence).

This project also had me reflecting on great beginnings, like Toni Morrison’s beautiful (go read it now) Beloved:

124 was spiteful. Full of a baby’s venom.

How do you make a great beginning? “Make the subject of the sentence an obscure sequence of numbers to get the reader’s attention. In case that doesn’t work, follow up with a terrifying, baby-related metaphor.” (More advice here.)

To handle this, I grabbed the opening sentences from 837 different novels. Note that I was strict so even though a lot of beginnings require an interplay of first and second sentences, by beginnings I really mean one sentence and one only.  A more comprehensive study would look at the relationship between lengths, as Watt mentioned (see above).

Back to algorithms. It’s important to notice that a one-word sentence like “Impossible!” gets a terrible Reading Ease score because it’s got so many syllables in such a small space. “Sure.” is easier to read than “Impossible!” but it doesn’t strike me as something that ought to really disadvantage an author that much. So what I’ll report are buckets of word counts: Flesch Reading Ease scores for medium sentences (4-8 words) and for long ones (9+ words). The infographic also shows the percentage of short sentences (1-3 words) and hard sentences (Flesch Reading Ease less than 50).

Among the easiest to read authors are the Brothers Grimm, Henry James, Mark Twain, Lewis Carroll, and Sir Arthur Conan Doyle. But there’s some variation there: James’ Wings of the Dove has very easy to read medium-length sentences but its long sentences in volume 1—but not volume 2—conjoin and detach phrases for 100 words after they begin (search for She wore her “handsome” felt hat or With which he had it again all from her here).

The hardest to read are Mary Shelley, Victor Hugo, Jane Austen, Franz Kafka (David Wyllie translation), and George Eliot.

The sentences with the lowest Reading Ease scores generally go to Victor Hugo (Isabel F. Hapgood translation), which has a number of 300+ word sentences. Those are so rough that the Reading Ease scores for them are negative. This isn’t just a French or French-in-translation thing, even Mark Twain has some 200+ word sentences in Huckleberry Finn.

And tweets? Tweets belong somewhere in between. In terms of long sentences (9+ words), tweets are super-easy to read. That’s partly because if you’ve managed to fit 9 words within Twitter’s 140-character limit, odds are that each word is short.

tweets-are-harder-to-read-th

But in terms of 4-8 word sentences, tweets are among the hardest things to read, sandwiched between Kafka and Austen. Keep in mind that #hashtagsareconsideredtoughtoread. This is basically true whether we use tweets-as-they-are or if we only consider tweets that don’t have @’s or links since those tend to be longer strings that sort-of-shouldn’t-count (the subset of tweets that don’t use @ and don’t have links is about 100,000 tweets big).

But let’s look at how often you’ll come across a difficult sentence in these things. I’ll define a sentence as difficult if it has a score of 50 or lower. In that case, we can see that 34% of Frankenstein‘s sentences are difficult—mainly because there are some really long run-on sentences.So you’ll encounter tough sentences most frequently in Frankenstein and second in Pride and Prejudice where 31% of the sentences are hard.

If you read through a bunch of tweets—ignoring the ones with @’s and links because those make tweets look especially difficult to the algorithm—then about 11% of the tweets you encounter are difficult. That’s comparable to The Adventures of Sherlock Holmes (11%) and a bit harder than Alice in Wonderland (9%) and Huckleberry Finn (8%). Note that only 14% of the sentences in Henry James’ The Ambassadors are tough-to-read. For what it’s worth, James’ average percentage-of-hardness is 16% across the five works I looked at, George Eliot’s is much higher: 25% of her sentences hard.

As for the opening lines, of the 837 first sentences I looked at about 18% of them are tough (for English majors: the opening sentences of authors that Harold Bloom considers canonical are especially long and complex).

Readability isn’t just a measure of how hard/easy something is for cognitive processing, it also tells us something about author and audience. Check out Dan Jurafsky’s work showing how readability distinguishes the language on potato chip packaging (fancy potato chips have fancier sentences).

Final words

Some authors are tough to read. And they have always been tough to read. The process of reading a book or a Twitter stream is a process of familiarization. So difficult books—and Twitter—can actually teach you how to read them as you go along.

Struggling towards understanding is not necessarily a bad thing; it can be beautiful and enriching to find one’s way through a dark wood. But there are times when accessibility is crucial—think about healthcare issues or that complicated email you fretted over last week.

In addition to sentence/word-based metrics for readability, Idibon also can report frequency-based measures since psycholinguists have decades of evidence that more frequent words are easier to process than rare words. Our clients have used readability metrics to automatically categorize comments on websites as well as as a feature to predict helpfulness of comments (and other kinds of business-specific needs).

Henry James despised sentences that were a “mere seated mass of information”.  Without syntax, words would just sit there like lumps. It’s syntax that gives them backbone. There’s meaning in the way they come together.

– Tyler Schnoebelen (@TSchnoebelen)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: