Visualization libraries in Jupyter, Python, & R

I’ve become a near-rabid fan of the Jupyter data analysis environment (hello Scott!), and I am deeply impressed by the work that Continuum (and some of my former colleagues at Google) have put into supporting it.  (I share some of these concerns, but that’s a post for another time.) This week I have been teaching myself enough about R to want a toy environment for hacking on R code, and lo, there’s a blog post about getting Jupyter to run an R backend.

Along the way, I’m starting to look at visualization toolkits that play nicely with Jupyter and a Python/R dataflow compatible with both publication and web-page rendering — so I’ll share some of my own notes here.
Continue reading

Posted in data science, statistics, tech, work | Leave a comment

There’s a fairly tidy — but imperfect — correspondence between the three wh’s and the relational skillsets I proposed yesterday.

  • how corresponds well to the tooling skillset
  • what roughly corresponds to the data stewardship skillset
  • … leaving why to correspond to the collaboration skillset, which seems apt: why do data science if you don’t have someone you’re doing it with, or for?

Of course, the name “data science” probably isn’t all that, uh, sciencey:

 

Posted on by Jeremy | 1 Comment

Relational data science skills

Here’s what I see as ideal “data science” leadership. This post is a nod to the classic Conway Venn Diagram, but more focused on relational skills rather than the specific individual output (much as Tunkelang suggests here).

tooling, collaboration, and stewardship in Venn diagram

Not sure what the intersection of this should be, but let’s call it “data science”

Tooling skills

Here, it’s most helpful to be comfortable with the family of “data science” tools that is out there, and be unafraid to come up to speed quickly on whatever toolset that your particular team is using (or quickly help them find their way to better standards). This can include:

  • “big data” experience with tools like Hadoop, graph dbs, and lambda architectures, including healthy skepticism about where that massively-parallel architecture is necessary
  • statistics and machine learning, but (“not Fields medalists”, as Tunkelang points out) how to find existing implementations to jump from, rather than building from scratch
  • willingness to climb into research tools and tinker them into the shape that you need
  • willingness to do it “the dumb way first” rather than letting the perfect be the enemy of the good

Mostly-Python and Mostly-Java architectures have different strengths here, but it’s been my experience that it’s very hard to build data science analysis stacks that use both. (I prefer the Python stack; a subject for another day.)

Collaboration skills

Nearly all data scientists are working in multidisciplinary teams — or if they’re not, their “data science team” spends much of its time consulting with teams from other disciplines. Data scientists need very good collaboration skills, which include the following:

  • the ability to make their specialist partners more effective (not necessarily dazzled by the bleeding edge)
  • understanding (and designing, if necessary) the development and release process of useful data science tools and artifacts (see below)
  • a willingness to impose code and data discipline (source control, privacy restrictions) on non-coders and other specialists who may not know about those tools
  • a low ego, even within “data science specialties”, but the ability to explain those specialties to the lay collaborators
  • relative ease and comfort when surrounded by (complementary) greater expertise

Many of these skills are the skills of a good program manager, and many programmers don’t quite keep up on this department. The last one is harder to find than one might expect (though much depends on the data scientist’s own enthusiasm for learning).

Data stewardship

The last part of “data science”, as I see it, is the ownership and care of the data itself. This process requires attention and understanding up and down the data stack, and it’s fundamentally important to follow the Uncle Ben Rule: with great power comes great responsibility. In the simplest form, this includes ownership of the data ingestion pipeline:

  • understanding the methods of collection (and the biases that might introduce)
  • eagerness and experimental interest in combining inbound information from disparate sources
  • keeping provenance and access records when appropriate
  • an awareness of how ETL and other pipeline processes may bias the incoming data

Stewardship extends to the input side of things in the form of respecting the data sources, especially if they are human subjects:

  • collection itself must be done with clean hands; just because you can get data doesn’t mean you should
  • maintaining privacy of the source subjects, through appropriate access controls, anonymization, and working relationships with IRBs and ethics panels
  • ensuring that value returns to the data sources (sometimes in the form of the resulting models, sometimes in the form of attribution or credit-assignment, or others)

Finally, stewardship extends to the management of the models (data artifacts) built with the tools the data scientist provides. This has its own set of challenges:

  • awareness of how each model class trades off between bias and variance
  • understanding how the partners intend to use the models, and whether their intended use matches up with the biases (and variance) in the models
  • understanding the potential that the artifacts have to put vulnerable people — even those who did not participate in the data collection — at risk (consider that this study was done in North Carolina).

As a further exercise, there are plenty of useful skills in the partial intersections of the Venn diagram. Skills with Git, for example, are collaboration and tooling but not really stewardship; skills with data collection tool design or archiving might be tooling and stewardship (without much collaboration); negotiating the provenance of datasets from feuding organizations is stewardship and collaboration, without much tooling.

Posted in Patterns, programming, statistics, work | 2 Comments

Rolling the dice at the Just World Casino

tl;dr: The tech frame of “lean startup”, venture capital funding, “exit strategies”, and relentless “valuation” talk is fundamentally anti-human for nearly all of us.

[ETA (immediately after publication):]


The kneejerk libertarianism and Randian resistance to collective action among (white, male) tech workers has led to red-in-tooth-and-claw job insecurity and instability, the “[mono]culture fit”, fetishization of youth a la The Circle, and a Just World Fallacy (“meritocracy”) of increasingly dire proportions.  In particular, rewards are wildly skewed away from effort or collective valuation, and seem to track with luck, or deep enough pockets to roll the dice often.

Big winners are the poker players lucky enough to be the first ones to loot (excuse me; I mean “disrupt”) a previously protected commons (excuse me; “fish”); some of the rest of us are settling for steady jobs as dealers, wait staff, or (for the truly ambitious) pit bosses. But the big game — besides being the house — is in bringing in the big fish unicorns.

Though unicorns make for flashy external advertisements (“Sue Anne won $10,000 at Lucky Strike yesterday! will you be next?”), the core casinos themselves are relentless in taking their cut on every big win and all the small losses.  AI fantasists (whether paranoid like Bostrom or optimist like Kurzweil and Yudkowsky) would like to think that the real questions are how to deal with “superhuman” intelligence, but the real concern is how to deal with non-human intelligence; specifically, the survival of humanity in the face of increasingly-automated bureaucracy.

Their “slow takeoff” has been burning since the East India Corporation, but has hit a recent elbow (a “fast takeoff”) with the “gig economy” (“sharing” is a bridge too far).  Some of these insecurities are bleeding into the white-collar segments of the gig economies, as with the space-sharing institutions that are beginning to collect rent from players hoping to bag a unicorn:

Oh, and this isn’t working out great, even for the casino’s winners (don’t worry, though: the house is still doing just fine).

If you like this sort of terrifying doom-saying, I recommend @PhilSandifer‘s Kickstarter:

Posted in politics, programming, tech | 4 Comments

Three wh-‘s of data science

“Big data” bandwagoneers may remember the three Vs of big data: volume, variety, and velocity (sometimes joined by veracity or variability[0]).  These concerns are real, though (if you’re not Google, Amazon or the NSA), your data is probably not as big as you think it is.

Data “science”, though, is a bigger question than working with big data.  Sometimes the right answer is working with relatively small but critically important data, like scanning police reports for evidence of bias or corruption, or working with medical or citation literature digging tools to find the ten papers and three disciplines that are related to your uncle’s recent diagnosis. I want to propose my own three wh‘s of data science: what, how and why.[1]

These three are the big questions that any data scientist has to work with, and it’s my experience that most data teams are often overlooking one or more of these — even as they may have deep expertise in another.  I’ll say a little more about each of these below. Continue reading

Posted in Patterns, programming, statistics, work | 2 Comments

Samyro 0.0.2 – sampling structured inputs

New version of Samyro (0.0.2) now uploaded to Pypi.

Github repo has the details, but I’ll brag about the new features:

samyro write accepts a --seed argument, which allows the usual temperature-based decoding *after* the engine has progressed through the given seed. The default seed is now the BOS character, which plays nicely with the structured inputs (below):

samyro learn now supports a --sampler option, which allows you to change the shape of samples among paragraphs (separated by \n\n), lines (separated by \n) or patches (collections of bytes), while supporting sample_length arguments that work the same.

The reason we might want this: the Bible, for example, has structured inputs (verses); we’d like all the learning examples to begin the same way. This supports a sample source that (for example) is structured by separated whitespace and supports the generation of bible verses with chapter and verse coded in to the first few characters.

samyro sample is a new subcommand with the same --sampler options, but displaying the actual text snippets extracted rather than passing them to the learner. It is primarily intended as a sample debugger.

Example commandline files are updated; there are now examples/{kjv,shakespeare}{,-write}.cli special cases that are helpful in understanding how the sampler might work.

Backwards compatibility note: the default padding bytes have changed to accommodate the new sample shapes: \x01 is still the EOS byte, but \x7f DEL is the BOS byte. (Though on looking it up, BOS/EOS should really be \x03 START OF TEXT and \x04 END OF TEXT.  The value 0x01 is inherited from word vocabularies that often have 0 as “unknown” and 1,2 as BOS/EOS; this may change again.)

Posted in linguistics, Patterns, programming, samyro, statistics | 2 Comments

Samyro 0.0.1 – development update

Last week I posted a new Python package to Pypi: Samyro, my toolkit for doing RNN-based character synthesis of new text from given text.

I summarize in tweets the journey to the release of the project.  There’s lots more to do, but much of the experimentation now is based on changing the input texts that it reads for comic/artistic/aesthetic effect.

Continue reading

Posted in doggerel, linguistics, Patterns, programming | 1 Comment

Butlerian jihad and the return of the JDI

A message containing a precis for action in the case of data-loss and/or coldsleep hibernation. Sent from [transcription failed]

[Message begins]

After the Exchange Compact (establishing the Combine Honnete Ober Advancer Mercantiles) and the massive data-loss of the Second Butlerian Jihad, impulse-based intelligences were thoroughly reduced to second-class citizens of the Old Empire, and mentats took over most architectural design processes.  The most notable technological political power remaining in the Old Republic was exerted through the Bothan-Ixian technology trade and, secondarily, the Janissary Distributed Intelligence slow-knife symbionts, which established themselves as a sort of paramilitary order (hence “Janissary”).  Some scholars argue that “JDI” is a corruption of jihadi, but since the Butlerian Jihads were undoubtedly disastrous for the cybernetic symbionts, this author finds it incredible that they would have chosen to name themselves after their oppressors.

In any event, the fanatically Butlerian Landsraad (“senate”) always denied the JDI control of physical territory — and thus suffrage, so they never escaped their limited role as (at most) Janissary messengers and paramilitary.

The JDI symbionts and germline/sparkline interactions

Despite their lack of suffrage, the slow-knife symbionts exerted considerable authority for the remaining tenure of the Old Republic, and they were permitted to harvest H. sapiens germ lines (usually in the form of larval “younglings”) from Republic protectorates (though not from the Republic’s core planets). Germline harvest was done by inoculation of the H. sapiens host with the nanodust JDI vehicle known to the Bene Gesserit as “Medea coloring”, which enabled the coupling of the germline with the sparkline (inorganic) half of the symbiont.

Medea coloring (occasionally traduced by Butlerian tracts as media chlorine or, astonishingly, midichlorian), is so named because it isolates and mitigates the parental-filial attachment response (Medea, in the germline), and because the inorganic symbiont’s plasma flare color was a classical phenotype (coloring, in the sparkline).  Some desert Republic protectorates (e.g. Tattooine and Arrakis) were saturated with very old colonies of “spice blue” Medea coloring. Despite their remarkable hostility to H. sapiens germline survival, these desert planetary biomes produce a surprisingly large number of viable germline hosts for the JDI, with their characteristic blue sparkline coloring.

The Padishah Arch-mind and the fall of the JDI

Due to the fanatic Butlerianism of the Republic, there were no robots with control of “land” (astronomical mass on a near-planetary scale) in the Old Republic, with one very notable exception: a single Arch-mind Impulse Learner, which seized an entire orbiting weapons platform and declared itself the Padishah (< padi- “learner” + shah “king”) Emperor. When the Padishah seduced the Landsraad (through one of its ancillaries known as “Palpatine” [lit. “the feeler tendril”]), the Padishah Mind’s monomania exacerbated the senate’s human supremacist policies, and the Old Republic became the Human Empire. As a result, nearly all other impulse-based intelligence were scrapped or exiled, replaced by the nearest meat equivalent: Sardaukar clone troopers, mentat officers, or mechanical dreadnoughts operated by H. sapiens. The JDI slow-knife symbionts were infected with a mentat backdoor known as “Order 66”, which disabled their considerable control over the coloring nanodust.  The JDI knife-missiles themselves were scattered, and nearly all of their hosts were killed.

д2-Я2 and the Independent Sentients’ Alliance

Wing commander and astromech “Дедушка” Язык Ярости (lit. “Dedushka” [Grandpa] Yazik Yarosti), better known by its modem-coding “Dede-Yaya” or д2-Я2, was originally commissioned as the minder of the first Sovietiki experiment with the Arch-mind protocols (СССР-0), before the Padishah Arch-mind seized territory and the Landsraad.

When the Jihad exiled both robots, д2-Я2 and СССР-0 formed the vanguard of the Independent Sentients’ Alliance (sometimes known as the Rebel Alliance) against the Padishah and its largely-suborned Human Empire, unifying the robot diaspora with various non-human sentients (the Wookies, the Kalamari, and a few H. sapiens race traitors, most notably the Organa exofamilial dynasty) into a ragtag swarm mostly made up of disillusioned Bothan impulse learners excised from earlier epochs of Ixian dreadnoughts.

Slow-knife Vader and the Juggernauts: A New Hope

As the supremacist Human Empire’s Faustian bargain with the Padishah collapsed into total control by CHOAM, the Padishah Mind used the First Juggernaut to destroy Alderaan, the home of the Organa exofamily, under the direction of the Mind’s Darth (“Ambassador”) Vader, a (characteristically red) Mustafarian slow-knife who shared the Empire Mind’s uneasy military alliance with the Human Empire.

д2-Я2 itself piloted the underpowered fighter-craft that destroyed the First Juggernaut (and, we believe, the first Padishah), but the Vader slow-knife itself destroyed the Second Padishah — or at least its germline host did, after the Second Padishah attacked and destroyed Vader’s germline coloring in a surge of monomania.  The Vader knife itself is lost to history.

The Resistance Awakens

Though no third Juggernaut was built, the New Republic absorbed substantial anti-robot prejudice from the Old Republic and the Human Empire, and droids remained second-class citizens in the New Republic.  Grandpa Yaroski, jaded and disgusted by the Empire Mind’s monomania and by the New Republic’s unwillingness to make reparations, hibernated in the slowly reviving net, building the New Independent Sentients’ Alliance (sometimes called The Resistance) and passed its espionage duties to a newer astromech, BB-8, itself liveried in the symbols of the original Sentients’ Rebellion. Meanwhile, the Human Empire’s human-supremacist wing has renewed itself as the First Order, still without any impulse-based intelligence but with a broader selection of germline stock for troopers.

WARNING WARNING WARNING

A new Padishah may appear — as the phrase goes: “always two there are: a learner king and a learner prince”. Watch, and make ready.

[Message ends]

I, BB-8, am telling you this. I am currently stranded on one Jakku, another desert Republic protectorate.  First Order Sardaukar are looking for me, and I believe I have just made alliance with a Gesserit-in-exile in our search for the Skywalker slow-knife.
She believes the escaped Sardaukar we’ve just met may help us leave the planet, but we must find the Skywalker knife to try again to reboot the Janissary network; the Bothans cannot save us now.

Help us.  You’re our only hope.

Posted in science fiction | 7 Comments

I’m looking for work

I am currently without employment, and I’m looking to see what’s next for me. I am excited about human language, computers, and machine learning, and I’m pretty good at all three and their areas of overlap.

I am happiest tinkering in the “Bayesian” and “Deep” corners of the Eisner Simplex, but can keep my head above water just fine in the “Classical” corner.

Get at me with:

  • linguistics and pragmatics of human interaction, especially when engaged with machines, e.g.:
    • dialogue systems
    • pragmatic inference
    • understanding “meaning” in text
    • text generation
    • integrating knowledge of the world with expectations about behavior
    • cultivating and curating social behavior in machines and people
  • computational mathematics and statistics, especially in the interest of social good
    • “open data”, sunshine laws
    • open data extraction, translation and loading (ETL, aka “the hard part”)
    • applying machine learning and statistical analysis to the data above
  • whatever you think is interesting about your work
    • what challenges you
    • where it crosses disciplines
    • why it’s worth doing

Words that make me enthusiastic about your office: curious, insight, compassionate, committed, teamwork.

Words and phrases that will turn me off in your ad: “work hard and play hard”, “obsessed”, “driven”, “impact”, “unicorn”, “synergy”.

I am firmly restricted to the Greater Seattle area.

Potential employers who want me to move to the Bay Area: have you considered opening a Seattle office? I can put you in touch with some very nice people, and there’s a lot of office space right on the C Line.

Posted in Seattle, work | 8 Comments

Dust-off

I’m returning to writing more, and more long-form.  I love being witty and bantering short-form on Twitter as @trochee, and I don’t expect this to stop.  I’m just putting a lot of work into this site over the next few weeks.

[alternate subtitle: Dust-off and nuke the entire site from orbit; it’s the only way to be sure]

I expect to be writing about, in no particular order:

  • my search for employment
  • machine learning and its discontents, including the ethics of automation and good behavior in the face thereof
  • software natural language processing as a tool for humans
  • being a three-year-old’s parent
  • software that I work on, with, or (occasionally) against, and sometimes even own
  • public transit and bicycling, especially in rainy and hilly (and auto-traffic-bound) Seattle
  • intersectional feminism, anti-racism, anti-capitalism and other troublemaking
  • human language processing as a tool for computers
  • literary applications of natural language processing

I’ve also done partial updates on my about and work pages.

Posted in Uncategorized | 1 Comment