“Big data” bandwagoneers may remember the three Vs of big data: volume, variety, and velocity (sometimes joined by veracity or variability). These concerns are real, though (if you’re not Google, Amazon or the NSA), your data is probably not as big as you think it is.
Data “science”, though, is a bigger question than working with big data. Sometimes the right answer is working with relatively small but critically important data, like scanning police reports for evidence of bias or corruption, or working with medical or citation literature digging tools to find the ten papers and three disciplines that are related to your uncle’s recent diagnosis. I want to propose my own three wh‘s of data science: what, how and why.
These three are the big questions that any data scientist has to work with, and it’s my experience that most data teams are often overlooking one or more of these — even as they may have deep expertise in another. I’ll say a little more about each of these below.
What data are you working with? This encompasses both variety and volume, but an understanding of the variety of data sources that are available usually requires substantial expertise from someone in the actual application domain. Understanding the data (in particular, understanding where it does not have coverage; that is, knowing the variety you’re missing) requires a lot of end-user conversation with the data providers. This bleeds over into how, especially when the data scientists are trying to understand the ways in which the collection process corrupted or biased the examples coming in.
Working with data irregularities and corruption — e.g. normalization and other headaches of ETL, usually swept into the variability bin — is only the beginning of how, which also includes the question of velocity and veracity. Different analytic choices can constrain what you can learn from data, and many of those choices bias the answers or even eliminate some of the questions that can be asked: A/B tests, for example, can learn pairwise preferences, but cannot usually be used to compare radically different designs.
I’ve worked with teams that insist that everything is a graph, and I’ve been the one insisting that every data point be represented as part of a graph, but some analyses really need something different. I’ve worked with teams that insist that every operation is a map-reduce operation, and teams that insist every operation is a neural net forward pass; each of these assumptions provides amazing operational flexibility in some ways and is astonishingly hog-tying in others.
Different apices (rational, deep, theoretical) of the Eisner Simplex for machine learning will offer different machinery (how) for learning, but to choose among them here we bleed over into why.
It’s often a challenge to ask data scientists why they want to make a certain analysis. One all-too-common answer is “because it’s there” — this attitude is common enough — and so insufficient — that it can basically be automated for comedy’s sake. But outside of the click-generation business (advertising and sales), it’s often quite difficult to identify why the data scientist is doing a particular analysis or providing a particular tool. It’s here that the “public data scientist” bloggers and public figures have lots of work to do.
One often-overlooked role of the scientist or data wrangler is to design (and redesign and redesign) machinery that quantifies the quality of the analyses that the wrangler is providing, and this loops back around into a deep understanding of what the data contains and how it was collected.
Failing at one (or more) wh
I have some opinions about the relationship between data what, how, and why — and who’s succeeding where [to add in the remaining wh- terms].
Big data: Tech popularizers are often heavily focused on how, to the detriment of both what and why, usually trying to sell a particular framework that worked well for $BigEngineeringOutfit. I’ve seen too many “implement word count in fourteen lines” examples to count without a massive army of parallel monkeys, and knowing that I can do word-count doesn’t tell me if your framework will let me solve the data problems I’m dealing with.
Human relationships: On the other hand, nifty problems are often missing the right set of useful how answers beyond parallelism. Does your nifty parallel framework correctly normalize addresses? How about identifying proper names? How about identifying misogynists among your prospective friends or lovers? If you the nifty parallel processor maintainer said no to all of these, that’s probably good — separation of concerns is wise; parallelism and address processing should be orthogonal — but who is talking about advances in those problems? There are academic efforts to detect depression in speech, but how would you go about applying this in the clinic or the home?
Medicine and epidemiology are obviously areas that have lots of good why motivations: only the most committed misanthrope thinks preventing and treating disease and illness is a bad idea. If why isn’t all that clear, though, then (like Theranos) the how can be corrupting on its own: “fake it ’til you make it” is not an acceptable analytic practice in blood testing, but it might be for ad ranking. Elsewhere in medicine, there is lots of data (what) available to the right practitioners, but its variety and variability are enormous obstacles to how that data can be processed in volume.
Corruption: The Panama Papers investigation provides an example of a sudden what becoming available, and a round-the-clock search of all the data — largely by hand. This remarkable achievement was accomplished with a strongly shared why — to expose the corruption of this particular global elite. In the absence of other useful “data science” tools, they still constructed a useful how by networking journalists internationally to share results.
Civic transparency: The Open Seattle collaborations with data.seattle.gov seem to be unusually strong in what, have lots of people to ask about how, and are searching around for a why. (It’s nice to have a scavenger hunt to find all the public parks, but it’s not exactly a killer app.) I’m interested and excited to see what happens with that.
The arts: art might be the only field that can get away with knowing what and how without knowing why; then, of course, the tinkerer of data “science” is actually an artist with new tools.
 or veganism, as in:
Q: “how can you tell if someone is a
veganbig data programmer?”
A: “wait a minute; they’ll tell you.”
Grammar sticklers Sharp-eyed readers may note that “how” does not begin with a wh-, but this is a sleight-of-hand I’m stealing from linguistics.