Monday, December 14, 2015

Domino dancing. A medium data research workflow with Ruffus on AWS.

I don't normally blog about technology because frankly, it is too much like work. But I have been trying Domino data labs of late, a nice product by a team mostly out of Bridgewater who clearly have good experience trying to make life easier for quants. They have built a research platform most readily used with Amazon compute but with the option of an in-house install. I certainly feel their efforts are worthy of note.

I don't really know what a data science platform should be necessarily, but perhaps it is a little like picking a dance partner: they have to know when to get out of the way in order for things to move effortlessly. That was a quality I found lacking in both in-house and competing platforms that seem to assume you are trying to step in a certain direction and in the process, inadvertently make it difficult to do simple things you feel you have already mastered. In short, they feel a bit like dancing with ... err ... me.

My workflow

I'll describe my setup for context, though I doubt there is anything atypical about my needs. I had already decided on Amazon over some in-house technology, incidentally, for several key reasons including the on-demand capability and availability of an attached file system. Another preference was use of a single beefy machine rather than many small ones because some of my steps require all the data together. With 8 GPU machines standard, and a 100 processor option coming in 2016 it is hard to see the merit in administering my own box.

My econometric/algo work (a.k.a. data science these days) pushes thousands of data files through a few dozen processing steps. After running local tests I sync both data and code using the Domino client to AWS, then kick off a job there. When the job is done, I simply sync again and examine the results locally.

I use the free python library Ruffus to create a non-blocking pipeline and handle the splitting and recombining of data files therein. I can't show you my actual computation graph as it is proprietary, so here's a stock image from ruffus instead:

Imagine the same with a few dozen steps and you get the idea. Much of what I do lives on one or more of these growing DAGs.

Serving results behind a REST API

The nodes on my computation tree comprise between one and a few thousand files of the same type. The leaves comprise summary tables (sometimes documentation) for this and that. Some output becomes lookup tables for REST API's deployed on Amazon. Domino provides simple version management for the code and data serving those endpoints, making this effortless.

One merely provides the entry point: in my case a plain python function to respond to the request (no wrapping/unwrapping). At present Domino only allows a single entry point but that is hardly a restriction. I merely register the following dispatcher function to automatically exposes every other function I drop in the same module.

That's REST, pretty much for free.

Monitoring runs

No complaints here. Domino presents your runs to colleagues alongside run diagnostics such as the following.

It could be slightly easier to arrange the results, arguably, but one way around that is to maintain your own markdown index page (README.md) and point users to where you really want them to be. As a minor side point I discovered that the markdown rendering was just slightly non-standard - but I'm sure that will be knocked off the issue list by the time you read this.

Serving richer results

By default Domino tracks the new files produced by your runs and renders them to collaborators browsing the results. If you want to serve up richer content than a pile of .csv's you can have your run create an .html file and use javascript. As suggested by one of the Domino developers, here is one way of adding sortability to output tables:


Thanks John. The same goes for plots. Need something fancy? Just add the package to your requirements.txt and it will be installed prior to the run starting.

Reproducible research

In a sense Domino is pioneering what truly reproducible research might look like. I made a very poor stab at the same vague concept many years back so I am sympathetic as to the difficulties. (That effort lived on this site incidentally, and was so dismal I'd forgotten it myself until just now.)

There is an opportunity. While the workflow for developing code is obviously well established in source control programs and accompanying verbiage, the organizing of code, data and intermediate temporary data tends to be more bespoke. The Domino starting point, philosophically, is that code and data be kept in sync and this ensures absolute reproducibility of results. But they have not baked any religious assumptions into the product and one is free to break that symmetry. I'll get to why it helped me to do so a little later.

Certainly the major annoyances (code sharing with package management) and organizational features (basic comment tracking etc) are cleanly solved if your colleagues are also using Domino. One points them to a link to start the run that reproduces your work, or they can select from a menu on your recent runs.

Transparent research

While multi-directional collaboration is a marketing point for Domino, I find that the weaker variety of this (call it transparency) is a sufficient motivation. There is certainly no barrier here for non-technical end-users who want to re-run your research with different parameters and then compare the results. And that comparison is straightforward. One selects two finished runs in the run history and selects the "compare" checkbox to get the full diff.

Note however, that if you keep your code in the same project this comparison will include all your source changes. If you don't want that for yourself or your readers, a simple hack here is to use .dominoresults to hide your code and echo major meta-parameters to their own output file in lieu. Then the Domino presentation of your backtest results (say) is actually very helpful.

I'm reasonably happy with the level of transparency that the Domino platform provides. I feel it will replace a fair amount of presentation preparation (I'm a bit lazy on that front). I maintain a stable version of data and code for this purpose - or at least I aspire to - so that others can drill down into very large runs. One might hope that the days of manually converting piles of data into long powerpoint presentations are behind me, though that might be optimistic.

Perhaps an area for improvement or enhancement of the Domino platform would involve documentation. Some support for Latex generation of documents for instance.

Separating code and data again

As my data sets grew, I ultimately found it more convenient to use a separate project for code and several different projects for data. There are distinct categories of data living remotely and locally. Some data needs to be synced between local and remote instances to make it easier to run quick experiments. Some data is merely temporary. Some data should only ever live on Amazon due to it's size.

Then there is the pragmatic concern with code. I personally didn't want the possibility of large globs of data corrupting my defacto code repo. Nor was there any need for the bulk of my code to be shared with those wanting to make minor hacks to the results.

These concerns are addressed. The Domino team was very helpful and introducing read-only sharing between projects. Thereafter I moved to a pattern of "code only" and "code-light data heavy" projects. In my project holding source code I allow read-only export to the other projects.

To minimize syncing of large files (and lots of files) to amazon I use a chain of Domino data projects. My medium-data flow is as follows:

  1. Ingest data locally to ~/zipped_data_project/
  2. Sync the compressed data to Amazon with Domino client
  3. Unzip on Amazon to a /mnt//unzipped_data_project/... by allowing this second project to import files from /zipped_data_project, (and also from the source project containing the script in question and the rest of one's code).
  4. Allow a results project /mnt//endpoint_project to read the unzipped data and run the pipeline, creating result files and lookup tables. Use .dominoignore config to decide what is worth keeping.
  5. Publish the endpoint.
I've leaving out source control & backup since there are many ways to do that. I use my own hacks to ensure I always backup right before syncing, because very occasionally one may need to recover from a git issue under the hood. The implicit version control achieved by syncing with the Domino client is, for me, more of a secondary line of defense.

I suspect some future improvements to Domino might obviate the minor inconvenience of the zip/unzip shuffle. And some of my issues there relate mostly to sub-par connectivity.

Overall impression

Sweet. Hopefully not sweet as in the Pepsi versus Coke trick (i.e. ask me again when I've drunk the whole can) but even if it is regarded as mere sugar for Amazon compute, Domino data labs seems to me to represent good, very reasonably priced sugar that doesn't get in the way of itself or you. It isn't trying to replace anything, just bring together a few critical ingredients like git, docker, jupyter, python/R/Julia package management and some friends from the Apache family.

The Domino team has cut the time for basics like REST APIs, job management, launchers for non-technical users, scheduling, sensible run management and results rendering. It is drawn together seamlessly with Scala (okay the fact I know that is a tell. But let's say 99% seamless with only the occasional stack trace finding its way to the user :-). You don't need a separate AWS account and overall, it is very simple.

Many quants or supporting groups might no doubt build their own versions of the same but I've come to realize that pulling all this together, or even a poor man's version, can easily fall foul of Hofstadter's Law. It is a little too soon to say that Domino have commoditized this area, but they seem to be off to a good start and with a decent size engineering team in place, I expect Domino to get better. As a quant, that's good news. Liberating even.

Costs seem to be well below internal costs. And my recommendation is that even for hobby projects involving only python, R and/or Julia one might want to consider whether pricing like the following is unreasonable when it obviates a great deal of messing around.


That's my 2 cents for now (or my 9.3 cents as the case may be). For those who are interested I've written up a few more hacks in this followup post.