[tahoe-dev] making Tahoe nicer to hack on

Brian Warner warner at lothar.com
Mon Oct 18 05:32:55 UTC 2010


A little while ago, someone at Mozilla did a short presentation on
"Firefox Papercuts", based upon looking at hundreds of vague complaints
thrown out to the surprisingly-caring-but-still-not-a-support-channel
winds of "#firefox" on Twitter (in particular, things that were annoying
enough to complain about, but not always important enough to file a bug
about). Things that were causing enough friction to make people unhappy,
but weren't really a big enough deal to rise up the priority list, but
which might not be too hard to fix once somebody got serious about it.

It got me thinking about similar problems I've had with Tahoe
development. Once upon a time, I found Tahoe incredibly easy to hack on,
especially compared to its predecessor. I was very careful to make sure
that e.g. you could run tahoe without installing it (or creating a big
complicated MySQL customer database first), could run a whole test grid
on a disconnected laptop, easily produce or acquire a .deb package for
installing to a production machine, etc.

Over the years, some of that ease has been lost. At least, personally,
I've frequently sat down to do some Tahoe hacking (for which I have a
lot less time these days), started to do something, ran into a build
error or a slowdown, got frustrated, and wandered off towards something
that felt more productive. The frustrations are usually too small to do
much about other than grumble to myself (not even worth telling the
twitterverse about them, especially since I don't want to discourage
others out there with complaints that aren't in the form of bugs with
patches and good unit tests).

I finally sat down and figured out what some of these pain-points are,
and started a branch to fix some of them. It's a fairly radical
experiment. I'm not sure that this approach is a good one yet, but I
think it's worth exploring.
http://github.com/warner/tahoe-lafs/tree/unsuck contains the current
results (N.B. it will be rebased frequently).

Here are some of the pain points I identified. This will sound like a
rant. Some of these issues may be specific to my personal workflow (a
result of switching to a computer with a slower filesystem, not using a
modern version of darcs, being picky about what touches my /usr/lib).
Some of them aren't easily fixed. But that doesn't diminsh their
annoyance to me.

For many of these, we have tickets open, but that doesn't help address
my frustration, especially when the ticket can only be closed by
convincing a third-party tool author to accept a fourth-party patch.
Many of these tickets have been open for several years, suggesting that
we're nowhere near fixing them.

** pain points
   - tests take too long #20
     - re-does the build each time #657 #717 #799
       - null builds take several seconds, emit *pages* of output
         #591 #788
   - small tests take too long, are too noisy
   - "build" shouldn't even exist #479
   - frequently re-downloads source tarballs #1220
   - unsafe downloads, finding links on world-writable trac project
     pages
   - twistd/trial as subprocesses in just-built dependencies are touchy
     and complex, manipulating $PATH is dubious, differences between
     unix and windows makes it more complex
   - darcs is slow, hard to publish branches, hard to use local
     branches, hard to rewrite history before publishing, does not
     provide useful short secure versionids, is not widely available, is
     hard to build, has no good hosting services, has poor frontend/GUI
     support. Trac integration is massively inefficient, causing 500
     Internal Server Errors and timeouts when things like 'darcs
     annotate' are being run.
   - setuptools is too picky about versions of dependencies being
     available
     - tahoe should work on systems with outdated deps installed as OS
       packages, by building newer local copies. #1190
     - developers should be able to test against alternate versions of
       deps by manipulating PYTHONPATH, but eggs/easy-install.pth breaks
       this #709
   - must rebuild source tree after moving it ('develop' pseudo-symlink
     is absolute path?)
   - setuptools makes life hard for packagers: they want a 'setup.py
     install --prefix tmpdir' that never downloads anything, does not
     want to modify existing files in the installdir, composes well with
     other packages, and plays well with existing OS python packaging
     policies
   - way too many dependencies, 17 that I measured, some hard to justify
     (zbase32? pyutil? mock?). Benefit of using dependency must dwarf
     the overhead/pain of finding/downloading/building/installing it.
     Small deps should be inlined (mock). Non-critical deps should be
     dropped (zbase32). Larger deps should be carefully weighed (I'd
     like to drop Nevow).
     - what would it take to remove *all* non-python-stdlib
       dependencies?
     - OS packagers need separate packages for each dep, having lots of
       them makes their life difficult, which hinders adoption. It is
       much easier to depend upon popular already-packaged deps than
       upon obscure/exotic ones. Adapting our code/approach to the
       community will get tahoe packaged faster than asking them to
       adapt to us. #703

I think we've outgrown darcs. It's neat, it had features and
conveniences that nothing else had at the time (interactive
chunk-at-a-time commit was awesome), it still records more information
and can deal with merges better than probably anything else, but it's no
longer appropriate for us. It presents a barrier to entry for new users,
it makes it hard to share code, and it adds a 15-second chunk of
friction every time I try to do the tiniest little "darcs whatsnew -s",
an operation that ought to take milliseconds.

I'd like us to move to Git: it is fast, popular, well-supported, makes
it easy to share code, and makes it easy to use VC for my own personal
experiments (without creating a new tree, and rebuilding all the
dependencies, for each local branch I want to manage). I'm doing all my
local development in Git now, I can share code on github, and I only
deal with darcs when I bridge my changes to/from the master on hanford.
I worry about that bridge breaking, and I feel bad that it lags because
I don't feel safe running it from a cron job, so I look forward to
getting rid of it altogether.

"Setuptools delenda est" (awesome turn of phrase from David-Sarah:
"setuptools, it must be destroyed".. BTW the original was about
Carthage). Zooko has spent the last three years of his life bending over
backwards to deal with setuptools' bugs, misfeatures, incompatibilities,
impedance mismatches, generally broken release/development process,
maintainers, competitors, forks, toothpicks, patches, plugins, and
packaging. We currently have an unreleased snapshot of an unacknowledged
fork of an unmaintained tool that seems unwilling to do what we want it
to do.

To its credit, easy_install made my transition from a debian desktop to
an OS-X laptop a bit less painful: like CPAN's command-line tools, it is
an "awfully" easy way (both senses of the word are applicable) to get
python code onto your sys.path . I like that convenience. However I
freak out when I see it scanning the internet and installing (as root!)
the first thing that looks vaguely applicable. Scraping download links
off of project home pages (which could be world-writable wiki pages!) is
a root compromise waiting to happen.

Other than that, all of my experiences with setuptools have been
negative ones: VersionConflict or PackageNotAvailable errors when I can
import the given version just fine, multiple slow downloads and rebuilds
of packages that are already there, opaque build processes for a
non-compiled language, hundreds of lines of noise and multiple seconds
of slowdowns preceding tests when nothing has changed. And frequently
the recommended workaround is to modify something in my OS (as root),
which violates the consistency of my OS packaging system. Building
debian packages required a complete rewrite of setup.py and probably
doesn't even work these days. All of the AMD buildslaves involved in
debian packaging had to be "fixed" with various outside-of-dpkg changes
to the point that I no longer had confidence that they represented stock
lenny/karmic/etc systems and gave up maintaining them.

So, what I want from our packaging system:

   - users who start from a tarball and a reasonable OS should have an
     easy, short sequence of commands to get to ./bin/tahoe --version
   - tarball plus unreasonable OS (i.e. not having a compiler) should be
     possible, if longer
   - build from VC checkout should be no more than one extra step more
     complicated than build from tarball
   - OS packagers should have a 'make install PREFIX=' target like
     they're used to, which doesn't download anything and plays well
     with the OS-level policy of where things go
   - fast, quiet, unobtrusive, controllable, easy to override, no
     metadependencies
   - tolerate older versions that are already installed by the OS
     - never ever ask a user to uninstall something from the OS just to
       get Tahoe working
   - be able to get everything reasonable from tahoe-lafs.org in a
     single download, be able to tell which download you need

Dependencies: we have too many, and many of them are on pretty unusual
things. Most of them have sub-dependencies on even more unusual things
(pyutil, setuptools_trial). I'd like to use viz or dot or one of those
graph-visualization tools to show the whole thing at once: it's kind of
frightening, especially compared to the value we're getting out of some
of them. There are advantages to moving from "write it yourself" to "use
somebody else's library", and there are advantages to moving from
"include their library in your source tree" to "ask the user to install
that library first" to "try to auto-install that library", but there are
also drawbacks. Each additional dependency puts more pressure on our use
of setuptools, makes life more difficult for OS packagers, and makes it
harder to get Tahoe running.

Some, like Twisted, are pretty big and pretty central to our
architecture (so we couldn't feasibly write-it-ourselves or include it
in our source tree). But others, like the 8kB "mock" library (used only
for unit tests), could just be inlined, or write-it-ourselves'ed.

Here's a full list of the dependencies I was able to track down, with
the primary ones (i.e. what Tahoe itself cares about) in the first
column, and all secondary deps indented. Some secondary deps are
referenced by multiple packages.

 twisted
  -zope.interface
 nevow
 foolscap
  -pyOpenSSL
 simplejson (stdlib in py2.6)
 sqlite3 (stdlib in py2.5)
 zfec
  -setuptools
  -darcsver
  -setuptools_darcs
  -argparse
  -pyutil
   -setuptools_trial
   -simplejson
   -argparse
   -zbase32
 pycryptopp (for AES, RSA, SHA256[stdlib in py2.5])
  -setuptools, darcsver, setuptools_darcs
  -setuptools_pyflakes, stdeb
 pyasn1 (for twisted.conch/SFTP)
 pycrypto (for twisted.conch/SFTP)

Given that Tahoe only uses 4 files from zfec (about 30 lines of python
and 1300 lines of C), it'd be nice if we didn't need to reference all 10
of its sub-dependencies. Likewise it might be nice to find less
dependency-heavy ways to get at AES and RSA (and use SHA256 from
hashlib). I like pycryptopp better than pycrypto, but if we've committed
to providing SFTP out-of-the-box, then maybe we should consider getting
our AES and RSA from pycrypto and dropping one set of dependencies. Or
maybe we should not commit to SFTP out-of-the-box, and build some sort
of plugin scheme: maybe we could reduce the size of tahoe-core and
reduce the dependency burden on folks who don't care about SFTP.


= SOLUTIONS =

My first step has been to set up a Git mirror of the tahoe tree, which
has solved many of my problems with Darcs. It also makes it possible to
easily publish a branch on which I'm experimenting with approaches to
solve some of the other problems, available here:

 http://github.com/warner/tahoe-lafs/tree/unsuck

Here's what I'm trying out in that branch:

 - rip out any mention of setuptools and pkg_resources
 - provide a "setup.py check_deps" command which, in a subprocess,
   attempts to import each known dependency, obtain its version with
   e.g. "foolscap.__version__", and compare it against a known minimum
 - provide "setup.py build_deps" which looks for tarballs of
   pre-determined names in a local directory (i.e. ../tahoe-deps/),
   unpacks them in a new subdir, and runs their "setup.py install" (with
   --single-version-externally-managed where necessary) to make them
   available in support/lib/pythonX.Y/site-packages
  - the build is performed by os.execv(sys.executable, "setup.py",
    "install"), so it always uses the same version of python as was used
    to run the top-level "setup.py build_deps", and does not search PATH
  - each build is done in a separate process, so the results of one
    build are available for import by the next
  - builds are only done when the check_deps import fails or gets an old
    version
 - remove the setuptools+entrypoints -generated bin/tahoe with a script
   like the original tahoe-0.2.0 (june-2007) bin/tahoe, which adds
   TOP/support/lib/pythonX.Y/site-packages to the import path and then
   does a simple 'from allmydata.scripts.runner import run; run()'
  - the new bin/tahoe modifies PYTHONPATH and re-execs itself. This
    allows eggs in the support/ dir to be processed, and ensures that
    subprocesses can do the same. The original one modified sys.path
    instead, and was hard to get working right for subprocesses
 - stop exec()ing standalone trial and twistd (for 'setup.py test' and
   'tahoe start', respectively). Instead, import those modules from
   Twisted and invoke their functions from within python. This removes
   the need to duplicate the shell's "which" builtin, and removes some
   of the difficulties on windows where you might be executing trial, or
   trial.py, or trial.bat, or trial.exe, or something.
  - this also removes a subprocess from the 'tahoe start' path, so that
    problems during tahoe.cfg loading or library importing are reported
    in the parent process, rather than being buried in the logs and a
    parent which emits "node probably started".
  - the new rule is: never exec() something that didn't come with Tahoe
    or with Python, our two fundamental dependencies
  - this also removes a lot of noise and slowdowns from "setup.py test"
 - relatedly, make "setup.py test" delegate to twisted.scripts.trial,
   basically by inlining some of the contents of setuptools_trial but
   removing a lot of the wrapper layers. The fact that distutils
   commands cannot just pass all of sys.argv into the child is a drag,
   as it means you must create new distutils options for every feature
   of the underlying tool you wish to expose. Also simplified things by
   allowing trial.run() to do its own sys.exit(), rather than trying to
   call into an internal function to avoid that.
  - this means that e.g. 'python setup.py test othercommand' won't work:
    it will never run othercommand. But it's much simpler and more
    robust this way.
 - move all version-measurement code out of __init__.py and into
   util/versionutil.py . I can't stand having lots of code in
   __init__.py, especially code which spawns subprocesses during
   'import'. Also I made the output of "tahoe --version" more legible
   (one package per line), and I'm planning to change it to only emit
   tahoe's version unless you add a --all flag.
 - generate src/allmydata/_version.py (which contains a string, not a
   pyutil_Version or distutils.version.LooseVersion instance) by
   checking git metadata, which can be run in O(1) time.
 - copy mock.py into src/allmydata/tests/ to remove it as a dependency.
   If/when a newer version comes out, it's easy to copy that version
   into place.

The next directions I'm likely to go are:

 - add a flag to 'setup.py build_deps' that will allow downloads of
   needed dependency tarballs that are not found locally. These tarballs
   will be identified by hash and fetched with urllib, so no external
   tools are necessary, and nothing is added to the reliance set
   (network-side attackers cannot inject their own code into your
   dependencies).
 - add another flag to allow dependencies to be satisified by
   pre-compiled eggs, still identified by hash, probably hosted on
   tahoe-lafs.org . This should let the build-deps process work on
   systems without a compiler. Ideally, the eggs would be dropped into
   support/lib/pythonX.Y/site-packages/ where the PYTHONPATH-based
   lookup could find them, but I don't know if this would work without
   the easy-install.pth files. Perhaps the PYTHONPATH-setting logic
   could look in support/ for .egg files and add them directly to
   PYTHONPATH, but again I'm not sure that would be enough.
  - worst case, add a flag to offer to do "easy_install FOO" and put the
    result in support/ , giving up on safety against trojan downloads.
 - figure out some SUMO.tar -building automation, since the new scheme
   recognizes more dependencies than are currently in our SUMO tarball
   (e.g. mock, pyasn1, newer versions of many)
 - maybe build a single-file "check my system" script, and suggest that
   first-time users who are unsure of which tarballs to grab could start
   by running that script, which would check for all dependencies and
   then tell the user to either download the SUMO tarball or the smaller
   one. If we make sure that SUMO is comprehensive, it may get bigger,
   and it'd be nice to avoid a large download when it's not necessary
 - find ways to reduce dependencies
  - see about removing Nevow in favor of manually-generated HTML pages.
    We really aren't using most of Nevow's features.
  - require python2.6 and use stdlib json/sqlite instead of 3rd-party
    dependencies
  - do something about the dependency load of zfec/pycryptopp: maybe
    embed them, maybe talk zooko into improving them, maybe come up with
    replacements, not sure
 - improve test automation: build an schroot/debroot -based set of VMs
   with various debian/ubuntu releases that we care about (maybe on the
   newly-donated hardware), with well-defined package installs (e.g.
   both with and without foolscap), and add them to the buildbot. Maybe
   create them on-demand on EC2 instances.
 - improve debian packaging: start making tahoe .debs again. Make sure
   there are .debs for all our dependent packages so the tahoe.deb can
   actually be installed. Build test automation to assert this.
 - handle version numbers as something passed into each command, rather
   than being embedded in the source code. In particular, running
   "setup.py" or "./bin/tahoe" should compute a version number and make
   it available to all descendant processes. Running an installed
   /usr/bin/tahoe should use an embedded version. This is a direction I
   want to try on a number of build systems, as I think it may make more
   sense overall.
 - rip out any notion of "appname", which I think was a mistake that
   resulted from darcs version "numbers" being insufficient to capture
   branch information


Anyways, that's what I've been playing with in the last few weeks.
Please let me know what you think, and take a look at that branch. I'm
sure this will stir up some strong feelings: I'm eager to hear what
people feel about this kind of approach.

I've not tried this on windows yet, and I'm sure something will be
broken there, but I believe that it should be possible to achieve both
simplicity and works-on-windows, just as I'm sure it should be possible
to get something that is easy to use on sensible systems (e.g. debian
with all dependencies except for tahoe installed) and comfortable to use
on less-fully-featured systems too.

cheers,
 -Brian



More information about the tahoe-dev mailing list