I’ve become a near-rabid fan of the Jupyter data analysis environment (hello Scott!), and I am deeply impressed by the work that Continuum (and some of my former colleagues at Google) have put into supporting it. (I share some of these concerns, but that’s a post for another time.) This week I have been teaching myself enough about R to want a toy environment for hacking on R code, and lo, there’s a blog post about getting Jupyter to run an R backend.
Along the way, I’m starting to look at visualization toolkits that play nicely with Jupyter and a Python/R dataflow compatible with both publication and web-page rendering — so I’ll share some of my own notes here.
My desiderata:
- compatible with Python and R data-cleaning and -crunching upstream
- large data visualization, especially streaming or querying at the back end
- beauty for publication in journals
- interactivity for exploration (and web publication?)
Summary: The best choice will depend on how much you already have written. For new code (at least, for me), I think building directly in bokeh or ggplot(2) is probably the way to go.
bokeh seems to have a fantastic collection of useful examples built in, is designed around the Jupyter interface abstraction (by the Continuum team), and it claims an interface from R and from Python alike (and Scala and Julia, but let’s leave that to the side). It supports upstream behavior from matplotlib and seaborn, and plays extremely well with my favorite Python data structure (the pandas DataFrame). It’s almost exactly what I think I’m looking for, but I know that greenfield development is usually a pipe dream, so we have to consider cases where we’re building on existing analyses.
ggplot2 comes from the R community and the remarkable Hadley Wickham and presents a different interface to its phenomenally rich collection of visualizations. It works from inside Python as well (with the yhat/ggplot library) and is, uh, quite impressive in its existing bag of tricks. It’s not particularly pythonic in the yhat implementation, but since you’re likely to be cribbing examples from R code, it’s nice that its interface looks very similar in both languages. I’d like to understand how nicely this plays with the pandas DataFrame, and — similarly — how it plays with feather, Wickham’s most recent collaboration with Wes McKinney on Python/R data interoperability.
Python legacy
For legacy Python visualizations, building on matplotlib to bokeh conversion for small data, or rewriting in bokeh entirely for data scaling up are both happy ways forward. But here are some alternatives:
mpld3 looks like a pretty great way to make sure that your matplotlib plots generated for publication can also look good when presented on the web and in interactive mode during development and analysis. It’s basically a re-skinning of matplotlib into embeddable D3 Javascript. If you’re starting with a Python-to-matplotlib dataflow, using this library to change your inline output from .png to D3 is a nice nearly-drop-in solution. Nearly anything you develop that does inline d3 with this library will probably still generate nice .png files for publication. However, if your workflow is in R, this is essentially unreachable, because matplotlib does not play well with R. Recommendation: easy drop-in to existing Jupyter/Python workflows.
seaborn is a high-level abstraction above matplotlib, but does not render to D3 in an obvious way. It looks like a very pretty collection of plot families but not particularly focused on the interactive. Recommendation: a refinement of the matplotlib workflow. Use for beauty, especially if you have an existing Python matplotlib workflow.
R legacy
shiny attempts to tackle the same scheme as the entire Jupyter stack (including interactive visualizations a la bokeh), but within R. I don’t have enough experience within R to feel like this is a good direction for me or anybody I’m working with — there may be some room for combining in from other languages, but I’d rather start my own work in a language (Python) that I feel very comfortable with. Shiny also shares some external hosting problems with plotly (below).
Other APIs
plotly is a nice-looking library, and makes grand claims about offline behavior for R and Python, but I got tripped up by its fairly nice API that… regularly makes calls back to a central web service. As a data scientist with concerns about privacy and security, I do not want my visualizations to be shipping data elsewhere without some very clear arguments about why it would need to do this (and possibly a HIPAA/IRB review). Furthermore, the “stand-alone” part seems like an effort to encourage users to move to the paid APIs on a rough “freemium” model, which doesn’t work out well for researchers (cost-sensitive) or private companies (data-sensitive). Recommendation: fun to play with, but serious privacy concerns.
Leave a Reply