The dirty hands of data scientists

2 Aug


Welcome to this post! Now go read Harlan Harris, Sean Patrick Murphy, and Marck Vaisman’s 40-page book, Analyzing the Analyzers.

The goal of Analyzing the Analyzers is to reduce miscommunication for what is meant by “data scientist”. Their results come from 250 surveys, in which they asked data scientists about their backgrounds, their tools, and how they think of themselves. They come up with four types of data scientist:

data businessperson: primarily a leader, businessperson, and/or entrepreneur
data creative: a jack of all trades, artist, and/or hacker
data developer: a developer and/or engineer
data researcher: a researcherscientist, and/or statistician

In terms of skills, Harris and co. asked about business (e.g., product development, budgeting), machine learning/big data (e.g., NoSQL, text mining, JSON, SVMs, clustering, Hadoop), math/operations research (e.g., optimization, graphical models, Bayesian/Monte-Carlo stats, CS theory), programming (e.g., sysadmin, Java, C++), and statistics (e.g., visualization, time-series analysis, surveys, GIS, R).


Generally, I think of mosaic plots as plots-against-humanity, but theirs is clear and useful. In particular, I like how it helps you reflect on what kind of data scientist you are (or want to become). Their framework also makes it possible to assess whether you are in the right organization. One of my favorite definitions of integrity:

Integrity: integrating who you want to be with what you do

Does your organization support the kind of data scientist you want to be? If not, can you get it to shift or do you need to look elsewhere? These are hard decisions but market forces are on the data scientist’s side. 40+ hours is a lot of time to spend on anything that doesn’t make you who you want to be.

On that note, three of our favorite data scientists announced they are starting new jobs this week: Hilary Mason is joining Accel Partners, DJ Patil is joining RelateIQ, and Monica Rogati is joining JawBone. We wish them the best of integrity!

One of the personas (okay, personae) that Harris et al develop is a Binita, a Director of Analytics. She’s the representative of the data businessperson. It’s obvious how deep in the data the other three types of data scientists are. So what stands out for me is that even the data businessperson “really likes getting her hands dirty, diving into data sets when she has time”.

This is, I think, a defining characteristic of a data scientist. Btw, maybe we all are increasingly likely to get our hands dirty:

"Get x's hands dirty" in Google Ngram Corpus

“Get x’s hands dirty” from 1900-2000 in the Google NGrams Corpus (click for readability)

But maybe we’re not: see our post on going beyond raw word counts to track trends.

So what exactly does it mean to get one’s hands dirty? There are at least two models here: Gardener or Sartre. In the Sartre-sense, getting dirty hands is doing morally iffy stuff in order to ensure morally proper goals can be achieved later on. (You can also think of this in terms of Machiavelli, Max Weber, and Michael Walzer.) One big issue in data science has always been—and always will be—ethics. Where do the data come from? How are the analyses used? Addressing the ethics of data science is a Big Other Post.

When most data scientists talk about enjoying getting their hand dirty, we are not, of course saying, “I just love it when I get to achieve positive results or avoid disasters by violating the deepest constraints of morality.” We’re usually imagining something like gardening. Or if that sounds too retired, the image might be a kid happily wallowing in the mud. These images capture, alternatively, care and curiosity, planning and playing. These are also ways of being a data scientist.

But let’s connect gardening and Sartre. The data we analyze are social in its nature. This may be obvious if you’re analyzing Twitter or Facebook, but it’s also true if you’re monitoring particles or planets. My own focus is on data created by people (language data) but even measurements of tiny and astronomical things were conceived of and implemented by people trying to do something. And the analyses are also part of a social system.

This is one of the reasons I think social scientists are a crucial part of data science teams. Understanding the data requires understanding where it comes from and how it is getting used. (Check out Steve Miller’s post on computational social scientists.)

But having someone who can get a handle on the kind of meaning that the data points were imbued with when they were created has to be balanced with some real skills. The weeding of hypotheses. As a reference point, consider whether women wear red or pink shirts more at peak fertility.

Don't believe this

Be skeptical of the findings from Beall and Tracy (2013). This figure is from that work and shows the percentage of women at high conception risk in two different samples.

You can read Andrew Gelman’s critique of Beall and Tracy (2013) in Slate, but don’t miss their response and his-response-to-their-response in his blog. The fundamental critique is one of too many “researcher degrees of freedom”.

The standard in research practice is to report a result as “statistically significant” if its p-value is less than 0.05; that is, if there is less than a 1-in-20 chance that the observed pattern in the data would have occurred if there were really nothing going on in the population. But of course if you are running 20 or more comparisons (perhaps implicitly, via choices involved in including or excluding data, setting thresholds, and so on), it is not a surprise at all if some of them happen to reach this threshold.

Skepticism is a crucial tool in Gelman’s toolkit. That comes not just from asking statistical questions but asking questions about the questions we’re asking. And that isn’t just about analyzing the wording of survey items. It is also part of thinking through why we’re asking the questions we’re asking. The research question—do women give off visual cues to ovulation—has a lot to do with various ideologies of gender. Consider what it means. It means that saying that underlying my sisters’ choice in clothing is a desire (culturally and/or evolutionarily driven) to be attractive to men. Maybe they just want to look nice for themselves? Maybe they just want to be comfortable and their fuzziest clothing happens to be red?

I pick up the phone. One of my sisters is wearing grey today. “I don’t know where I am in my cycle.” My other sister is ready to have kids. “I don’t want to think about it. Yes. Maybe. I think so. I’m leaving for Brazil tomorrow morning so I don’t want to know.” She’s in a white t-shirt. “I was wearing an orange-ish sweater but it got hot so I took it off. Does that count?”

Whether we are in academic institutions or enterprises, we like finding patterns in vast, complicated, tangles of data. But does what we count, count? Ultimately, we want our analyses and our actions to add up, not to something statistically significant, but to something meaningful.

– Tyler Schnoebelen (@TSchnoebelen)



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: