Over 3k comments with sentiment coding

15 Nov
Just found this page (http://www.cyberemotions.eu/data.html) and thought I’d pass it along. If you go to the website, you can sign up to get access to their collection of:
  • BBC News forum posts: 2,594,745 comments from selected BBC News forums and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.
  • Digg post comments: 1,646,153 comments on Digg posts (typically highlighting news or technology stories) and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.
  • MySpace (social network site) comments: six sets of systematic samples (3 for the US and 3 for the UK) of all comments exchanged between pairs of friends (about 350 pairs for each UK sample and about 3,500 pairs for each US sample) from a total of >100,000 members and > 1,000 human classified sentiment strengths with a postive strength of 1-5 and a negative strength of 1-5. The classification is the average of three human classifiers.
Here are some examples of their classifications (although I think they just give you the average for each sentence):
  • hey witch wat cha been up too (scores: +ve: 2,3,1; -ve: 2,2,2)
  • omg my son has the same b-day as you lol (scores: +ve: 4,3,1; -ve: 1,1,1)
  • HEY U HAVE TWO FRIENDS!! (scores: +ve: 2,3,2; -ve: 1,1,1)
  • What’s up with that boy Carson? (scores: +ve: 1,1,1; -ve: 3,2,1)
Here’s the annotator agreement table for the MySpace stuff.
Previous emotion-judgement/annotation tasks have obtained higher inter-coder scores, but without strength measures and therefore having fewer categories (e.g., Wiebe et al., 2005). Moreover, one previous paper noted that inter-coder agreement was higher on longer (blog) texts (Gill, Gergle, French, & Oberlander, 2008), suggesting that obtaining agreement on the short texts here would be difficult. The appropriate type of inter-coder reliability statistic for this kind of data with multiple coders and varying differences between categories is Krippendorff’s α (Artstein & Poesio, 2008; Krippendorff, 2004). Using the numerical difference in emotion score as weights, the three coder α values were 0.5743 for positive and 0.5634 for negative sentiment. These values are positive enough to indicate that there is broad agreement between the coders but not positive enough (e.g., < 0.67. although precise limits are not applicable to Krippendorff’s α with weights) to suggest that the coders are consistently measuring a clear underlying construct. Nevertheless, using the average of the coders as the gold standard still seems to be a reasonable method to get sentiment strength estimates.

 

Table 1. Level of agreement between coders for the 1,041 evaluation comments (exact agreement, % of agreements within one class, mean percentage error, and Pearson correlation).
 
Comparison
+ve
+ve
+/- 1 class
+ve mean % diff.
+ve corr
-ve
-ve
+/- 1 class
-ve mean % diff.
-ve corr
Coder 1 vs. 2
51.0%
94.3%
.256
.564
67.3%
94.2%
.208
.643
Coder 1 vs. 3
55.7%
97.8%
.216
.677
76.3%
95.8%
.149
.664
Coder 2 vs. 3
61.4%
95.2%
.199
.682
68.2%
93.6%
.206
.639
Now, I’m not a huge fan of objective/subjective distinctions (see, for example, my review of computational linguistics stuff on emotion). But positive/negative and intensity do seem to be real, if incomplete dimensions and this might be a useful set of data.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: