Greater data science, part 2.1 – software engineering for scientists

This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper.

In many scientific labs, the skills and knowledge required for the research (e.g. linguistics fieldwork, sociological interview practices, wet-lab biological analysis) are not the same skills involved in software engineering or in data curation and maintenance.

Some scientists thus find themselves as the “accidental techie” in their local lab — maybe not even as the “accidental data scientist“, but doing specific software engineering tasks — you’re the poor schmuck stuck making sure that everybody’s spreadsheets validate, that nobody sorts the data on the wrong field, that you removed the “track changes” history the proposal before you sent it off to the grant agencies, etc.

Scientific labs of any scale (including academic labs, though they probably don’t have the budgets or the incentives) can really benefit from data science, but especially software engineering expertise, even — or perhaps especially — when the engineer isn’t an inside-baseball expert in the research domain.  I list below a number of places an experienced software engineer (or data scientist) can make a difference to a field she (or he) doesn’t know well.

Software collaboration practices

Most of these are skills that software engineers have acquired “the hard way”, by overlooking the value of standardized change logs and backups.  Data scientists can support their lab by promoting good software collaboration practices — possibly including teaching and mentoring:

  • write under code control is very helpful to collaboration, providing both backups and merge capacity to group work
  • literate analysis should be a good way to explain and self-document executable code (Jupyter or rMarkdown notebooks), and every publication or slide-deck that has figures should have one attached
  • provisioning analysis environments, like containers with full Anaconda-Python installs running Jupyter, with known versions of the critical dependency chains
  • analysis under code control can take advantage of git history to maintain backups and to collaborate across scientists
  • code review and/or pair programming should be a part of the scientist’s workflow: if an analysis is complicated, colleagues should review it both for honesty and for mutual growth

Data-oriented packaging practices

Software encodes ideas and mechanisms, much like journal articles and conference papers.  Referring to a particular journal article — or even a specific software package — often leaves out critical detail for replication.  Software engineers have architectures as elaborate as the MLA to specify the particular version (or version range) of software they expect to run.

  • package shared code: one way to simplify these analysis notebooks is to pull shared code out to a package; a bundle of shared methodology in the programming language of choice.
  • focus on sharing packages — it’s really best to share packages using the packaging and version-number schemes that make sense for the language in question. Even if only shared within the lab, packages need documentation, tests and usually tutorials and overviews, ideally written by someone who did not already write the code.
  • use standard packaging and continuous integration — Most major software communities (certainly: Python, Java, R) have fairly clear packaging standards and version number schemes, and means to include and execute documentation, data sets, etc

Code-oriented publication practices

Conversely, academic scientists are often driven — for better or worse — by citation counts as a proxy for shared utility. Data scientists and “wet-lab” scientists alike can (and should) use the software engineer’s concept of a package and version the same way a careful writer publishes the article’s author, article’s title, journal name, year and month of publication, etc.

  • cite the packages you use academically if possible; analyses should always document the specific versions of the packages you’ve installed
  • publish shared packages with version information, pointers back to code control, and (when possible and convenient, for h-factor and other not-necessarily-wise metrics) citeable releases, e.g. at the JOSS
  • “public” software (I too hate the term open source) should be a virtue to most scientists: sharing methodologies should be more than a paragraph of natural language description in a biochemistry journal article. This is slightly more than publishing the package “artifact”, but also (often) includes publishing the code itself.
This entry was posted in data science, programming, statistics, work. Bookmark the permalink.