Data workflows with SCons

Posted on March 05, 2015 in News

tl;dr: Using SCons for data workflows has its quirks, but can be very effective. And with a little upfront work, easy parallelism.

Recently I set to work updating an analysis of Mekong and Ganges surface water inundation trends that I was working on last year. In the time since then, my colleagues at City College have developed a new version of the global inundation product, SWAMPS (Surface Water Microwave Product Series, R. Schroeder et al., publication in prep). This analysis workflow involves importing inundation, precipitation, wave model output, and hydrological model output data, reprojecting to a common map projection, several analysis steps, and generation of output maps and plots. Looking at the mess the code had become, with code and intermediate results scattered about several directories, I decided to take the opportunity to port the code to a data workflow system.

Why workflows

As a scientist, I have a strong bias toward reproducibility and documentation of work. I don't work at a lab bench, but rather at a computer, so code and scripts tend to take the place of the traditional lab notebook. My hope is that using a real data workflow system will help increase the confidence I have that my computations and analyses are complete and correct. Depending on code, intermediate results, or one-off analyses scattered among different files, directories, or even computers is an easy way to introduce error.

I didn't find much guidance online about data processing pipelines using build systems other than make and the newbie drake, so I figured I'd write up my experience here.

As it happens, there are a number of dedicated data/scientific workflow systems. And many, many software build systems that can, with a little fiddling, presumably handle a data pipeline. At the heart of a build system like make is the generation of a dependency graph which runs commands to generate output files when input files are out of date. Those files can be data and figures just as easily as source code and program binaries.

Many of the all-in-one systems have unnecessary (to me) bells-and-whistles, or are designed specifically for the types of workflows found in astronomy, or bioinformatics, for instance. These seem to offer features such as visual programing, grid or cluster integration, a fancy client-server architecture... No thanks. I just want to run the code I write normally on my system.

Some argue that using a build system for data processing and analysis is overkill if you're not, say, processing new remote sensing scenes in a static manner each day. However, even for exploratory data tasks, I find that working in IPython and moving code to a script as soon as possible helps me iterate on ideas, and know where I left off when I come back the next day or week. The scripted code is either in a stand-alone script or tacked on after other processing steps. With stand-alone scripts, you need to remember or document what data or code this depends on, if you ever want to be able to use or update it again. And if its tacked on to other code, or you use a driver script, unless you want to wait for everything to run from line 1 each time, you'll be commenting out whole sections or introducing terrible if False: type blocks. And guess what - you just developed your own less-than-half-baked build system. Of course, if you want to use a real workflow even for simple tasks, then it had better fit in seamlessly with how you work.

A quick look around and I find make, drake, SCons, waf, and about a million other systems. While make is the venerable workhorse of software build systems, and I've used it extensively in the past for compiling software and building LaTeX documents, I quite despise the almost-shell-but-not-quite makefile language. And as I'd prefer not to spend my days fighting with a build system, I'd also like good documentation, and a mature and living project.

  • make: Could get the job done, but those makefiles...

  • drake: Looks like a great project, but it is still young and doesn't seem to have much of a community around it outside of the original developers, though I may be wrong on this. But mainly, I'd rather not learn some kinda-like-make-but-not syntax. Most of the innovative features here seem to revolve around integration with HDFS, which is cool but I don't need.

  • SCons: This is a Python based build system. The internet seems concerned with its speed, at least for projects with thousands of targets. Extensive documentation, for building software at least.

  • waf: Also a python project, but requires(?) packaging a compressed binary of its code along with your project. I suppose that makes sense for distributing source code, but I'd rather not litter my directories with the build system itself. The documentation seems to consist of a single very large PDF file. Getting started seemed much less straightforward than SCons.

Why SCons

So I jumped in with SCons. The most valuable feature, to me, is that the data pipeline is configured in plain Python, and it can run Python functions directly, without me needing to package every processing function up in a script, as would be the case with make. This saves a ton of time in porting existing code over. In most cases it has just been a matter of refactoring to increase code modularity and functional style, and changing how functions receive parameters. There are some quirks to using it for something other than compiling software though, which I'll describe here.

Install

There are several ways to install SCons, and all gave me varying degrees of trouble, but I eventually got pip to work inside a virtualenv with the --egg option:

$ pip install --egg scons

Some workarounds

However, then I was unable to import pandas from any code run by SCons. It threw errors related to importing pickle. Turns out that for Python 2/3 compatibility reasons (I believe), SCons fiddles with how python imports certain modules, including pickle. And while pickle and cPickle have mostly identical APIs, they actually differ slightly when it comes to subclassing certain objects. Which breaks pandas and possibly other code.

You can work around this by defining an environment variable named SCONS_HORRIBLE_REGRESSION_TEST_HACK to be non-empty. However, simply setting that variable causes another problem (see comments in $SCONSDIR/compat/__init__.py, lines 170 and 215 in version 2.3.0) in a section of code that doesn't want it set under normal circumstances. So we set the variable to a specific sentinel value, say 1234, and hack SCons to check for that value specifically.

In your ~/.bashrc, add:

export SCONS_HORRIBLE_REGRESSION_TEST_HACK=1234

Change $SCONSDIR/compat/__init__.py to have:

# line 215
if os.environ.get('SCONS_HORRIBLE_REGRESSION_TEST_HACK') is not None and \
    os.environ.get('SCONS_HORRIBLE_REGRESSION_TEST_HACK') != '1234':

Now SCons plays nicely with the rest of our code, and I think the only trade-off is that SCons will now use slightly slower pickle module instead of cPickle.

Setting up a project

SConstruct

At the top directory is your SConstruct file. A small project can put all build directives in this file, or it can call out to several SConscript files which contain the build instructions. This file is a normal Python file, but with all the SCons functions and objects magically imported and ready to use:

import os
env = Environment(os.environ)
env.Decider('MD5-timestamp')

The Environment() call sets up a build environment that you can use to set various build options. This is oriented toward software compilation, so we just pass along our regular environment.

By default, SCons uses MD5 hashes of file contents to determine whether an input file has changed. When working with even moderately sized data files, computing these hashes on every build can take a very long time. The MD5-timestamp option instructs SCons to first check file timestamps, and only compute the MD5 hash if the timestamp has changed, to confirm that file is actually different. This can speed things up dramatically.

Export('env')
option1 = True
Export('option1')

This makes env and option1 available to code in all your SConscript files. Note that the objects to export are passed as strings.

SConscript('subdir1/SConscript')
SConstript('subdir2/SConscript',
        exports=['option2'])

Here we call into two SConscript files, which are processed the same way as SConstruct, but allow you to structure your code more cleanly. Here, the subdir1/SConscript has access to env, and option1, while subdir2/SConscript has access to env, option1, and option2.

SConscript

For larger projects, I use several SConscript files organized logically into subdirectories, each importing their own python module with the actuall processing code. For small projects just put all this code directly into the SConstruct file.

In subdir1/SConscript:

Import('*')   #get access to the objects exported from SConstruct

step1 = env.Command(
        target='#build/ourdata.tar.gz',
        source=None,
        action='wget -O $TARGET http://someserver.com/SOME_LONG_NAME.tar.gz',
        )

step2 = env.Command(
        target='#build/ourdata.csv',
        source='#build/ourdata.tar.gz',
        action='tar -xzf $SOURCE',
        )

SCons offers many build tools that magically know how to turn, say, code.c into code.o or a program binary code. For arbitrary data processing pipelines, we can use the generic env.Command() function, which takes a target file, or list of files, a source file or list of source files, and runs an action command or list of action commands. An action will run whenever SCons needs the target, and either it doesn't exist, or it's sources have changed. In the source and target paths, the "#" identifies a path as relative to the top-level SConstruct file, which is useful if you want all your build outputs in one place, regardless of where the SConscript file is. SCons automatically creates any needed directories, like build. $SOURCE and $TARGET are expanded before the action runs.

From the command line in the directory with SConstruct:

$ scons

will try to build all targets under the current directory. If you specify a target:

$ scons build/ourdata.tar.gz

will run wget, if necessary, and then stop before building build/ourdata.csv.

Python function actions

The single best aspect of SCons for my needs is the ability to use regular Python functions as actions. In the examples above, SCons uses a shell to run wget and tar. Using a function directly also allows us to pass arbitrary Python data without serializing on a command-line.

We can use this contrived function from some_python_code.py to take the first few lines, specified as an additional parameter, capitalize them, and save to a target output file:

def make_head_upper(env, source, target):
    n = env['num_lines']
    with open(str(source[0])) as infile:
        with open(str(target[0])) as outfile:
            for i in range(n):
                outfile.write(infile.readline().upper())
    return 0

In our SConscript file, we use this as:

import some_python_code as mycode
env.Command(
        source='#build/ourdata.csv',
        target='#build/topdata.csv',
        action=mycode.make_head_upper,
        num_lines=5,
        )

The function parameters must be named env, source, and target, as SCons passes these arguments in by keyword. Additional keyword arguments to env.Command are passed through the env dict-like object to the function. source and target are lists of Node objects, which is annoying, but we can get pathnames from them using str() easily enough. Passing several sources makes them available as source[0], source[1], ...

Parallelized builds

Since SCons knows the dependency graph, it can run actions that are independent of each other in parallel:

$ scons -j 4

to tell SCons to use 4 threads. If your actions call out to external programs using the shell, this works very well. But, if you're using Python functions, all these threads will be in the same Python process, and due to the GIL you won't necessarily see that great of a speedup if you're doing lots of computation. I've also run into problems with SCons's thread-based model when trying to call out to GRASS GIS code. GRASS uses the filesystem as a map database, and doesn't like multiple threads or processes working on the same mapset concurrently. My normal solution is to tell each GRASS process to use a different mapset, and then move all the results to one place at the end. But, GRASS stores the mapset name as an environment variable, which is shared across all threads. That's a problem. Since SCons uses threads, rather then new processes, it means I can't parallelize any GRASS calls from my python functions using the multiple mapset trick.

One solution is to have your python function use the multiprocessing module. This is boilerplate to add to every one of your functions, and I'd like to still be able to run in single-process mode for the dramatically easier debugging with ipdb, without having to refactor the function each time.

My solution is to wrap the function using a decorator which both holds all the boilerplate, and also handles passing arguments in and return values out using a Queue:

from functools import wraps
from multiprocessing import Process, Queue

def in_new_process(func):
    @wraps
    def wrapper(*args, **kwargs):
        def worker(*args, **kwargs):
            queue = args[0]
            func = args[1]
            queue.put(func(*args[2:], **kwargs))
        queue = Queue()
        p = Process(
                target=worker,
                args=((queue, func) + args),
                kwargs=kwargs)
        p.start()
        p.join()
        return queue.get()
    return wrapper

Now, each SCons thread will launch a single Python subprocess to do the actual work. This works around the GIL, and for GRASS commands I can set the mapset in each process's environment separately. This works very well, at the cost of launching a new process. As long as each function does some real work, this cost is negligible.

Use it as:

@in_new_process
def make_head_upper(env, source, target):
    ...

Simply comment out the decorator and don't use the -j option to run single-threaded again.

This is especially useful if you have task parallelism. I often find myself running code to analyse multiple deltas individually, so I setup my SConscript as:

for delta in ['Mekong', 'Ganges', 'Amazon', ...]:
    env.Command(
            source='{}_input.plk'.format(delta),
            target='{}_processed.pkl'.format(delta),
            action=process_one_delta,
            delta=delta,
            )

With the @in_new_process decorator and running scons -j 4, this will process several deltas in parallel.

Wrapping up

Overall I've been very happy with the two mini-projects I've organized around SCons. I'm pleased to note that the second of those went much faster, and introduced very little additional mental burden compared with just blindly jumping into vim and IPython. There are a number of quirks stemming from the fact that SCons is designed foremost for building software, not data processing pipelines. Having worked around these, I've been writing much more modular code. I can change any part of the pipeline with confidence that I won't forget to update a later step to use the new results. Running tasks in parallel is mostly abstracted away. Computationally intensive, or slow downloads, can be included directly in the pipeline without slowing down my ability to iterate through the analysis process. Every step from data download to figure generation is documented in one place, and checked for completeness each time I run SCons.

Data processing documentation and reproducibility is critical as more scientific fields are becoming data-rich. Scripts are good for knowing what you did to accomplish a particular task, but then you must remember which scripts to run when and with what arguments and in what order. Hopefully, this introduction has showed you how I'm addressing this problem in my own work.