Data workflows with SCons
Posted on March 05, 2015 in News
tl;dr: Using SCons for data workflows has its quirks, but can be very effective. And with a little upfront work, easy parallelism.
Recently I set to work updating an analysis of Mekong and Ganges surface water inundation trends that I was working on last year. In the time since then, my colleagues at City College have developed a new version of the global inundation product, SWAMPS (Surface Water Microwave Product Series, R. Schroeder et al., publication in prep). This analysis workflow involves importing inundation, precipitation, wave model output, and hydrological model output data, reprojecting to a common map projection, several analysis steps, and generation of output maps and plots. Looking at the mess the code had become, with code and intermediate results scattered about several directories, I decided to take the opportunity to port the code to a data workflow system.
Why workflows
As a scientist, I have a strong bias toward reproducibility and documentation of work. I don't work at a lab bench, but rather at a computer, so code and scripts tend to take the place of the traditional lab notebook. My hope is that using a real data workflow system will help increase the confidence I have that my computations and analyses are complete and correct. Depending on code, intermediate results, or one-off analyses scattered among different files, directories, or even computers is an easy way to introduce error.
I didn't find much guidance online about data processing pipelines using
build systems other than
make
and the
newbie
drake
, so
I figured I'd write up my experience here.
As it happens, there are a number of
dedicated data/scientific workflow systems. And
many, many software build systems that can, with a little fiddling, presumably
handle a data pipeline. At the heart of a build system like make
is the
generation of a dependency graph which runs commands to generate output files
when input files are out of date. Those files can be data and figures just as
easily as source code and program binaries.
Many of the all-in-one systems have unnecessary (to me) bells-and-whistles, or are designed specifically for the types of workflows found in astronomy, or bioinformatics, for instance. These seem to offer features such as visual programing, grid or cluster integration, a fancy client-server architecture... No thanks. I just want to run the code I write normally on my system.
Some argue that using a build system for data processing and analysis is
overkill if you're not, say, processing new remote sensing scenes in a static
manner each day. However, even for exploratory data tasks, I find that working
in IPython and moving code to a script as soon as possible helps me iterate on
ideas, and know where I left off when I come back the next day or week. The
scripted code is either in a stand-alone script or tacked on after other
processing steps. With stand-alone scripts, you need to remember or document
what data or code this depends on, if you ever want to be able to use or update
it again. And if its tacked on to other code, or you use a driver script,
unless you want to wait for everything to run from line 1 each time, you'll be
commenting out whole sections or introducing terrible if False:
type blocks.
And guess what - you just developed your own less-than-half-baked build system.
Of course, if you want to use a real workflow even for simple tasks, then it had
better fit in seamlessly with how you work.
A quick look around and I find make
, drake
, SCons
, waf
, and about a
million other systems. While make
is the venerable workhorse of software build
systems, and I've used it extensively in the past for compiling software and
building LaTeX documents, I quite despise the almost-shell-but-not-quite
makefile language. And as I'd prefer not to spend my days fighting with a build
system, I'd also like good documentation, and a mature and living project.
-
make: Could get the job done, but those makefiles...
-
drake: Looks like a great project, but it is still young and doesn't seem to have much of a community around it outside of the original developers, though I may be wrong on this. But mainly, I'd rather not learn some kinda-like-make-but-not syntax. Most of the innovative features here seem to revolve around integration with HDFS, which is cool but I don't need.
-
SCons: This is a Python based build system. The internet seems concerned with its speed, at least for projects with thousands of targets. Extensive documentation, for building software at least.
-
waf: Also a python project, but requires(?) packaging a compressed binary of its code along with your project. I suppose that makes sense for distributing source code, but I'd rather not litter my directories with the build system itself. The documentation seems to consist of a single very large PDF file. Getting started seemed much less straightforward than SCons.
Why SCons
So I jumped in with SCons. The most valuable feature, to me, is that the data
pipeline is configured in plain Python, and it can run Python functions
directly, without me needing to package every processing function up in a
script, as would be the case with make
. This saves a ton of time in porting
existing code over. In most cases it has just been a matter of refactoring to
increase code modularity and functional style, and changing how functions
receive parameters. There are some quirks to using it for something other than
compiling software though, which I'll describe here.
Install
There are several ways to install SCons, and all gave me varying
degrees of trouble, but I eventually got pip to work inside a virtualenv with
the --egg
option:
$ pip install --egg scons
Some workarounds
However, then I was unable to import pandas
from any code run by SCons. It
threw errors related to importing pickle. Turns out that for Python 2/3
compatibility reasons (I believe),
SCons fiddles with how python imports certain modules, including
pickle
. And while pickle
and cPickle
have mostly identical APIs, they
actually differ slightly when it comes to subclassing certain objects. Which
breaks pandas and possibly other code.
You can
work
around
this by defining an environment variable named
SCONS_HORRIBLE_REGRESSION_TEST_HACK
to be non-empty. However, simply setting
that variable causes another problem (see comments in
$SCONSDIR/compat/__init__.py
, lines 170 and 215 in version 2.3.0) in a section
of code that doesn't want it set under normal circumstances. So we set the
variable to a specific sentinel value, say 1234
, and hack SCons to check for
that value specifically.
In your ~/.bashrc
, add:
export SCONS_HORRIBLE_REGRESSION_TEST_HACK=1234
Change $SCONSDIR/compat/__init__.py
to have:
# line 215
if os.environ.get('SCONS_HORRIBLE_REGRESSION_TEST_HACK') is not None and \
os.environ.get('SCONS_HORRIBLE_REGRESSION_TEST_HACK') != '1234':
Now SCons plays nicely with the rest of our code, and I think the only
trade-off is that SCons will now use slightly slower pickle
module instead of
cPickle
.
Setting up a project
SConstruct
At the top directory is your SConstruct
file. A small project can put all
build directives in this file, or it can call out to several SConscript
files
which contain the build instructions. This file is a normal Python file, but
with all the SCons functions and objects magically imported and ready to use:
import os
env = Environment(os.environ)
env.Decider('MD5-timestamp')
The Environment()
call sets up a build environment that you can use to set
various build options. This is oriented toward software compilation, so we just
pass along our regular environment.
By default, SCons uses MD5 hashes of file contents to determine whether an input
file has changed. When working with even moderately sized data files, computing
these hashes on every build can take a very long time. The MD5-timestamp
option instructs SCons to first check file timestamps, and only compute the MD5
hash if the timestamp has changed, to confirm that file is actually different.
This can speed things up dramatically.
Export('env')
option1 = True
Export('option1')
This makes env
and option1
available to code in all your SConscript
files.
Note that the objects to export are passed as strings.
SConscript('subdir1/SConscript')
SConstript('subdir2/SConscript',
exports=['option2'])
Here we call into two SConscript files, which are processed the same way as
SConstruct
, but allow you to structure your code more cleanly. Here, the
subdir1/SConscript
has access to env
, and option1
, while
subdir2/SConscript
has access to env
, option1
, and option2
.
SConscript
For larger projects, I use several SConscript
files organized logically into
subdirectories, each importing their own python module with the actuall
processing code. For small projects just put all this code directly into the
SConstruct
file.
In subdir1/SConscript
:
Import('*') #get access to the objects exported from SConstruct
step1 = env.Command(
target='#build/ourdata.tar.gz',
source=None,
action='wget -O $TARGET http://someserver.com/SOME_LONG_NAME.tar.gz',
)
step2 = env.Command(
target='#build/ourdata.csv',
source='#build/ourdata.tar.gz',
action='tar -xzf $SOURCE',
)
SCons offers many build tools that magically know how to turn, say, code.c
into code.o
or a program binary code
. For arbitrary data processing
pipelines, we can use the generic env.Command()
function, which takes a target
file, or list of files, a source file or list of source files, and runs an
action command or list of action commands. An action will run whenever
SCons needs the target, and either it doesn't exist, or it's sources have
changed. In the source and target paths, the "#
" identifies a path as relative
to the top-level SConstruct
file, which is useful if you want all your build
outputs in one place, regardless of where the SConscript
file is. SCons
automatically creates any needed directories, like build
. $SOURCE
and
$TARGET
are expanded before the action runs.
From the command line in the directory with SConstruct
:
$ scons
will try to build all targets under the current directory. If you specify a target:
$ scons build/ourdata.tar.gz
will run wget
, if necessary, and then stop before building
build/ourdata.csv
.
Python function actions
The single best aspect of SCons for my needs is the ability to use regular Python
functions as actions. In the examples above, SCons uses a shell to run wget
and tar
. Using a function directly also allows us to pass arbitrary Python
data without serializing on a command-line.
We can use this contrived function from some_python_code.py
to take the first
few lines, specified as an additional parameter, capitalize them, and save to a
target output file:
def make_head_upper(env, source, target):
n = env['num_lines']
with open(str(source[0])) as infile:
with open(str(target[0])) as outfile:
for i in range(n):
outfile.write(infile.readline().upper())
return 0
In our SConscript
file, we use this as:
import some_python_code as mycode
env.Command(
source='#build/ourdata.csv',
target='#build/topdata.csv',
action=mycode.make_head_upper,
num_lines=5,
)
The function parameters must be named env
, source
, and target
, as SCons
passes these arguments in by keyword. Additional keyword arguments to
env.Command
are passed through the env
dict-like object to the function.
source
and target
are lists of Node
objects, which is annoying, but we can
get pathnames from them using str()
easily enough. Passing several sources
makes them available as source[0]
, source[1]
, ...
Parallelized builds
Since SCons knows the dependency graph, it can run actions that are independent of each other in parallel:
$ scons -j 4
to tell SCons to use 4 threads. If your actions call out to external programs using the shell, this works very well. But, if you're using Python functions, all these threads will be in the same Python process, and due to the GIL you won't necessarily see that great of a speedup if you're doing lots of computation. I've also run into problems with SCons's thread-based model when trying to call out to GRASS GIS code. GRASS uses the filesystem as a map database, and doesn't like multiple threads or processes working on the same mapset concurrently. My normal solution is to tell each GRASS process to use a different mapset, and then move all the results to one place at the end. But, GRASS stores the mapset name as an environment variable, which is shared across all threads. That's a problem. Since SCons uses threads, rather then new processes, it means I can't parallelize any GRASS calls from my python functions using the multiple mapset trick.
One solution is to have your python function use the multiprocessing
module.
This is boilerplate to add to every one of your functions, and I'd like to still
be able to run in single-process mode for the dramatically easier debugging with
ipdb
, without having to refactor the function each time.
My solution is to wrap the function using a decorator which both holds all the
boilerplate, and also handles passing arguments in and return values out using a
Queue
:
from functools import wraps
from multiprocessing import Process, Queue
def in_new_process(func):
@wraps
def wrapper(*args, **kwargs):
def worker(*args, **kwargs):
queue = args[0]
func = args[1]
queue.put(func(*args[2:], **kwargs))
queue = Queue()
p = Process(
target=worker,
args=((queue, func) + args),
kwargs=kwargs)
p.start()
p.join()
return queue.get()
return wrapper
Now, each SCons thread will launch a single Python subprocess to do the actual work. This works around the GIL, and for GRASS commands I can set the mapset in each process's environment separately. This works very well, at the cost of launching a new process. As long as each function does some real work, this cost is negligible.
Use it as:
@in_new_process
def make_head_upper(env, source, target):
...
Simply comment out the decorator and don't use the -j
option to run
single-threaded again.
This is especially useful if you have task parallelism. I often find myself
running code to analyse multiple deltas individually, so I setup my SConscript
as:
for delta in ['Mekong', 'Ganges', 'Amazon', ...]:
env.Command(
source='{}_input.plk'.format(delta),
target='{}_processed.pkl'.format(delta),
action=process_one_delta,
delta=delta,
)
With the @in_new_process
decorator and running scons -j 4
, this will process
several deltas in parallel.
Wrapping up
Overall I've been very happy with the two mini-projects I've organized around SCons. I'm pleased to note that the second of those went much faster, and introduced very little additional mental burden compared with just blindly jumping into vim and IPython. There are a number of quirks stemming from the fact that SCons is designed foremost for building software, not data processing pipelines. Having worked around these, I've been writing much more modular code. I can change any part of the pipeline with confidence that I won't forget to update a later step to use the new results. Running tasks in parallel is mostly abstracted away. Computationally intensive, or slow downloads, can be included directly in the pipeline without slowing down my ability to iterate through the analysis process. Every step from data download to figure generation is documented in one place, and checked for completeness each time I run SCons.
Data processing documentation and reproducibility is critical as more scientific fields are becoming data-rich. Scripts are good for knowing what you did to accomplish a particular task, but then you must remember which scripts to run when and with what arguments and in what order. Hopefully, this introduction has showed you how I'm addressing this problem in my own work.