Greater data science, part 2.1 – software engineering for scientists

This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper.

In many scientific labs, the skills and knowledge required for the research (e.g. linguistics fieldwork, sociological interview practices, wet-lab biological analysis) are not the same skills involved in software engineering or in data curation and maintenance.

Some scientists thus find themselves as the “accidental techie” in their local lab — maybe not even as the “accidental data scientist“, but doing specific software engineering tasks — you’re the poor schmuck stuck making sure that everybody’s spreadsheets validate, that nobody sorts the data on the wrong field, that you removed the “track changes” history the proposal before you sent it off to the grant agencies, etc.

Scientific labs of any scale (including academic labs, though they probably don’t have the budgets or the incentives) can really benefit from data science, but especially software engineering expertise, even — or perhaps especially — when the engineer isn’t an inside-baseball expert in the research domain. I list below a number of places an experienced software engineer (or data scientist) can make a difference to a field she (or he) doesn’t know well.

Software collaboration practices

Most of these are skills that software engineers have acquired “the hard way”, by overlooking the value of standardized change logs and backups. Data scientists can support their lab by promoting good software collaboration practices — possibly including teaching and mentoring:

write under code control is very helpful to collaboration, providing both backups and merge capacity to group work
literate analysis should be a good way to explain and self-document executable code (Jupyter or rMarkdown notebooks), and every publication or slide-deck that has figures should have one attached
provisioning analysis environments, like containers with full Anaconda-Python installs running Jupyter, with known versions of the critical dependency chains
analysis under code control can take advantage of git history to maintain backups and to collaborate across scientists
code review and/or pair programming should be a part of the scientist’s workflow: if an analysis is complicated, colleagues should review it both for honesty and for mutual growth

Data-oriented packaging practices

Software encodes ideas and mechanisms, much like journal articles and conference papers. Referring to a particular journal article — or even a specific software package — often leaves out critical detail for replication. Software engineers have architectures as elaborate as the MLA to specify the particular version (or version range) of software they expect to run.

package shared code: one way to simplify these analysis notebooks is to pull shared code out to a package; a bundle of shared methodology in the programming language of choice.
focus on sharing packages — it’s really best to share packages using the packaging and version-number schemes that make sense for the language in question. Even if only shared within the lab, packages need documentation, tests and usually tutorials and overviews, ideally written by someone who did not already write the code.
use standard packaging and continuous integration — Most major software communities (certainly: Python, Java, R) have fairly clear packaging standards and version number schemes, and means to include and execute documentation, data sets, etc

Code-oriented publication practices

Conversely, academic scientists are often driven — for better or worse — by citation counts as a proxy for shared utility. Data scientists and “wet-lab” scientists alike can (and should) use the software engineer’s concept of a package and version the same way a careful writer publishes the article’s author, article’s title, journal name, year and month of publication, etc.

cite the packages you use academically if possible; analyses should always document the specific versions of the packages you’ve installed
publish shared packages with version information, pointers back to code control, and (when possible and convenient, for h-factor and other not-necessarily-wise metrics) citeable releases, e.g. at the JOSS
“public” software (I too hate the term open source) should be a virtue to most scientists: sharing methodologies should be more than a paragraph of natural language description in a biochemistry journal article. This is slightly more than publishing the package “artifact”, but also (often) includes publishing the code itself.

Posted

June 16, 2016

data science, programming, statistics, work

Jeremy

Tags:

Comments

6 responses to “Greater data science, part 2.1 – software engineering for scientists”

trochee

June 16, 2016

Improving data science for scientists https://t.co/wh6IXk3u3m

Reply
griverorz

June 16, 2016

Greater data science, part 2.1 – software engineering for scientists https://t.co/EESDsq5Ocy via @trochee

Reply
trochee

June 16, 2016

RT @griverorz: Greater data science, part 2.1 – software engineering for scientists https://t.co/EESDsq5Ocy via @trochee

Reply
Bill McNeill

June 16, 2016

I’ve been calling myself a data scientist for a while without knowing or much caring exactly what that means. But my most recent work has given me a plausible interpretation.

I work on a commercial machine-learning based information retrieval system. In order to determine how good it is, myself and a few other people performed a series of experiments comparing our system’s performance against various baselines. It was the typical sort of thing you would do if you were publishing an academic paper about a new machine learning technique.

Now we hadn’t come up with a new technique–we were just doing a competitive analysis of an existing system–but because we had experience in this field we knew how to frame it as a scientific experiment. We knew the standard methodologies, the right graphs to draw. Some of this knowledge was taken from our specific machine learning work, but a lot of it came from a more generate understanding of experimental design. We decided that the computer system was complicated enough that it merited us treating it the way scientists would treat a natural phenomenon.

In this case the “data” part of “data science” doesn’t mean much more than “related to computers somehow”, but the “science” part means “applying the scientific method” in a strict and traditional way.

Reply
1. Jeremy
  
  June 16, 2016
  
  well, this is interesting. Your work sounds like a close correspondence to what Tukey would have called “Data Analytics”, and specifically what Donoho suggests is the “Common Task Framework”.
  
  Treating a computer system as “complicated enough to be treated as scientists would treat a natural phenomenon” is also a useful class of models (also, those that are resistant to analytic proof).
  
  These fit together into the angle that developing complex systems is closer to agriculture or horticulture than it is to mathematics or physics.
  
  Reply
  1. Bill McNeill
    
    June 16, 2016
    
    Also you don’t need to be doing data science to see to virtue of the scientific method. The whole “treat the computer like a part of nature” idea is something I came up with twenty years ago when I was working my first job as a software tester at Microsoft.
    
    Reply