This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper.
Many aspects of Donoho’s 2015 “greater data science” can support scientists of other stripes — and not just because “data scientist is like food cook” — if data science is a thing after all, then it has specific expertise that applies to shared problems across domains. I have been thinking a lot about how the outsider-ish nature of the “data science” can provide supporting analysis in a specific domain-tied (“wet-lab”) science.
This is not to dismiss the data science that’s already happening in the wet-lab — but to acknowledge that the expertise of the data scientist is often complementary to the domain expertise of her wet-lab colleague.
Here I lay out three classes of skills that I’ve seen in “data scientists” (more rarely, but still sometimes, in software engineers, or in target-domain experts: these people might be called the “accidental data scientists”, if it’s not circular).
“Direct” data science
Donoho 2015 includes six divisions of “greater data science”:
- methodological review on data collection and transformation
- representational review ensuring that — where possible — the best standards for data representation are available; this is a sort of future-proofing and also feeds into cross-methodological analyses (below)
- statistical methods review on core and peripheral models and analyses
- visualization and presentation design and review, to support exploration of input data and post-analysis data
- cross-methodological analyses are much easier to adapt when data representations and transformations conform to agreed-upon standards
Coping with “big” data
- adaptation of methods for large-scale data cross-cuts most of the above — understanding how to adapt analytic methods to “embarrassingly parallel” architectures
- refusing to adapt methods for large-scale data when, for example, the data really aren’t as large as all that. Remember, many analyses can be run on a single machine with a few thousand dollars’ worth of RAM and disk, rather than requiring a compute cluster at orders of magnitude more expense. (Of course, projects like Apache Beam aim to bake in the ability to scale down, but this is by no means mature.)
- pipeline audit capacity — visualization and other insight into data at intermediate stages of processing is more important the larger the scale of the data
Scientific honesty and client relationships
data scientists are in a uniquely well-suited position to actually improve the human quality of the “wet lab” research scientists they support. By focusing on the data science in particular, they can:
- identify publication bias, or other temptations like p-hacking, even if inadvertent (these may also be part of the statistical methods review above)
- support good-faith re-analysis when mistakes are discovered in the upstream data, the pipelines or supporting packages: if you’re doing all the software work above, re-running should be easy
- act as a “subjects’ ombuds[wo]man” by considering (e.g.) the privacy and reward trade-offs in the analytics workflow and the risks of data leakage
- facilitate the communication within and between labs
- find ways to automate the boring and mechanical parts of the data pipeline process