Category: statistics

  • Greater data science, part 2.1 – software engineering for scientists

    This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper. In many scientific labs, the skills and knowledge required for the research (e.g. linguistics fieldwork, sociological interview practices, wet-lab biological analysis) are not the same skills involved in software engineering or in data curation and maintenance. Some scientists…

  • Greater data science, part 2: data science for scientists

    This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper. Many aspects of Donoho’s 2015 “greater data science” can support scientists of other stripes — and not just because “data scientist is like food cook” — if data science is a thing after all, then it has specific expertise that…

  • Greater data science, part 1: the discipline

    This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper. Donoho compares “data science” (or “data analysis”, a term he inherits from John Tukey) to statistics in terms of three foundational conditions, quoting Tukey: Let’s call these three core conditions content, structure, and (a means of determining) validity.  Anything with an answer…

  • Donoho’s “Greater Data Science’, part 0

    “50 years of Data Science”. Donoho, David.  2015. [link to downloadable versions] Donoho’s got a manifesto that ain’t foolin’ around.  I have a lot of thoughts about it, but I’m going to write them up as an open-ended series of marginalia on this remarkable essay. Data science is a thing after all I’ve said elsewhere (probably…

  • Visualization libraries in Jupyter, Python, & R

    I’ve become a near-rabid fan of the Jupyter data analysis environment (hello Scott!), and I am deeply impressed by the work that Continuum (and some of my former colleagues at Google) have put into supporting it.  (I share some of these concerns, but that’s a post for another time.) This week I have been teaching myself…

  • Relational data science skills

    Here’s what I see as ideal “data science” leadership. This post is a nod to the classic Conway Venn Diagram, but more focused on relational skills rather than the specific individual output (much as Tunkelang suggests here). Tooling skills Here, it’s most helpful to be comfortable with the family of “data science” tools that is out there, and be…

  • Three wh-‘s of data science

    “Big data” bandwagoneers may remember the three Vs of big data: volume, variety, and velocity (sometimes joined by veracity or variability[0]).  These concerns are real, though (if you’re not Google, Amazon or the NSA), your data is probably not as big as you think it is. Data “science”, though, is a bigger question than working with big data.  Sometimes…

  • Samyro 0.0.2 – sampling structured inputs

    New version of Samyro (0.0.2) now uploaded to Pypi. Github repo has the details, but I’ll brag about the new features: samyro write accepts a –seed argument, which allows the usual temperature-based decoding *after* the engine has progressed through the given seed. The default seed is now the BOS character, which plays nicely with the structured…

  • Looking for work, 2012 edition

    A short note (implied by my updates on Twitter), just to say: I was laid off last week from my previous employment in an abrupt downsizing — a company pivot, evidently away from the work I like to do.  I’m looking for work now.  Below the jump: what I’m looking for.

  • Norvig says it better

    Those who got fired up about Chomsky’s difficult comments regarding empiricism, including myself, will be gratified to see that Peter Norvig, patron saint of data-driven computational linguistics (inter alia), has released his own comments, along the same lines as mine, only better researched, more broadly researched, more respectful, more thorough, and, well, coming from the keyboard…