Donoho’s “Greater Data Science’, part 0

“50 years of Data Science”. Donoho, David.  2015. [link to downloadable versions]

Donoho’s got a manifesto that ain’t foolin’ around.  I have a lot of thoughts about it, but I’m going to write them up as an open-ended series of marginalia on this remarkable essay.

Data science is a thing after all

I’ve said elsewhere (probably also elsewhere on this blog) that I’m not sure “data science” is a thing: to paraphrase Dorothy Parker Aldous Huxley, data science has always seemed like “72 nine suburbs in search of a city metropolis”.

But I’m here to bring the good word: Greater Data Science, as Denoho describes it, probably is a thing.

Data science is more than “merely” CS or statistics

The activities of Greater Data Science are classified into 6 divisions: 1. Data Exploration and Preparation 2. Data Representation and Transformation 3. Computing with Data 4. Data Modeling 5. Data Visualization and Presentation 6. Science about Data Science

These six are not neatly captured in statistics, computer science, *or* the union of the two.

These six divisions aren’t covered by statistics (as an academic discipline), by computer science (though the union of the machine learning, distributed computation, and databases wings cover some of these), nor by the union of the two, which largely leaves out (1), (6), and — to some degree — (5).  Existing “Data Science” masters’ programs tend to cover some of the overlap between GDS and the union of statistics and computer science, and applied data scientists “in the field” (usually the industry) sometimes have fairly deep knowledge of (1) as it applies to their particular subdomain, e.g. geocoding.

Almost nobody covers (6) “Science about data science” from a well-informed content and structure, and I’d like to see more data scientists getting involved in all six parts here. Even more important to me is the idea that we share theory and praxis — in all six “activities” — across applications, which is familiar to statisticians but not to computer science nor applied domains like biostatistics or NLP.

Forward pointers

Future posts in this series will include thoughts about:

  • what is a discipline (and what isn’t — at least not yet)
  • the power of metanalysis (including surveys of methods) and the analogies to the common task framework
  • the value (and risks) of mentorship and the common task frameworks.


This entry was posted in academics, data science, programming, statistics. Bookmark the permalink.