[tahoe-dev] making Tahoe nicer to hack on

Ravi Pinjala ravi at p-static.net
Mon Oct 18 07:00:06 UTC 2010


This sounds like a really worthwhile project! Especially the stuff
about cleaning up the dependencies; I tried packaging Tahoe for Gentoo
at one point, and gave up completely after a few hours, partly because
of the number of dependencies unique to Tahoe (which I had to
recursively package as well), and partly because Tahoe would default
to automagically downloading and installing its own dependencies
(which is absolutely horrifying, honestly).

I'd even suggest that what you're suggesting for downloading the
dependencies is too much. The standard for *nix systems (in other
words, what users expect) is for the package to fail-fast at build
time if anything's missing, and let the package manager sort out the
deps ahead of time. Tahoe's well-known enough now, I think, that we
can aim to get it into distro repositories rather than expecting that
the user will have to handle deps if we don't at build time.

--Ravi

On Sun, Oct 17, 2010 at 10:32 PM, Brian Warner <warner at lothar.com> wrote:
>
> A little while ago, someone at Mozilla did a short presentation on
> "Firefox Papercuts", based upon looking at hundreds of vague complaints
> thrown out to the surprisingly-caring-but-still-not-a-support-channel
> winds of "#firefox" on Twitter (in particular, things that were annoying
> enough to complain about, but not always important enough to file a bug
> about). Things that were causing enough friction to make people unhappy,
> but weren't really a big enough deal to rise up the priority list, but
> which might not be too hard to fix once somebody got serious about it.
>
> It got me thinking about similar problems I've had with Tahoe
> development. Once upon a time, I found Tahoe incredibly easy to hack on,
> especially compared to its predecessor. I was very careful to make sure
> that e.g. you could run tahoe without installing it (or creating a big
> complicated MySQL customer database first), could run a whole test grid
> on a disconnected laptop, easily produce or acquire a .deb package for
> installing to a production machine, etc.
>
> Over the years, some of that ease has been lost. At least, personally,
> I've frequently sat down to do some Tahoe hacking (for which I have a
> lot less time these days), started to do something, ran into a build
> error or a slowdown, got frustrated, and wandered off towards something
> that felt more productive. The frustrations are usually too small to do
> much about other than grumble to myself (not even worth telling the
> twitterverse about them, especially since I don't want to discourage
> others out there with complaints that aren't in the form of bugs with
> patches and good unit tests).
>
> I finally sat down and figured out what some of these pain-points are,
> and started a branch to fix some of them. It's a fairly radical
> experiment. I'm not sure that this approach is a good one yet, but I
> think it's worth exploring.
> http://github.com/warner/tahoe-lafs/tree/unsuck contains the current
> results (N.B. it will be rebased frequently).
>
> Here are some of the pain points I identified. This will sound like a
> rant. Some of these issues may be specific to my personal workflow (a
> result of switching to a computer with a slower filesystem, not using a
> modern version of darcs, being picky about what touches my /usr/lib).
> Some of them aren't easily fixed. But that doesn't diminsh their
> annoyance to me.
>
> For many of these, we have tickets open, but that doesn't help address
> my frustration, especially when the ticket can only be closed by
> convincing a third-party tool author to accept a fourth-party patch.
> Many of these tickets have been open for several years, suggesting that
> we're nowhere near fixing them.
>
> ** pain points
>   - tests take too long #20
>     - re-does the build each time #657 #717 #799
>       - null builds take several seconds, emit *pages* of output
>         #591 #788
>   - small tests take too long, are too noisy
>   - "build" shouldn't even exist #479
>   - frequently re-downloads source tarballs #1220
>   - unsafe downloads, finding links on world-writable trac project
>     pages
>   - twistd/trial as subprocesses in just-built dependencies are touchy
>     and complex, manipulating $PATH is dubious, differences between
>     unix and windows makes it more complex
>   - darcs is slow, hard to publish branches, hard to use local
>     branches, hard to rewrite history before publishing, does not
>     provide useful short secure versionids, is not widely available, is
>     hard to build, has no good hosting services, has poor frontend/GUI
>     support. Trac integration is massively inefficient, causing 500
>     Internal Server Errors and timeouts when things like 'darcs
>     annotate' are being run.
>   - setuptools is too picky about versions of dependencies being
>     available
>     - tahoe should work on systems with outdated deps installed as OS
>       packages, by building newer local copies. #1190
>     - developers should be able to test against alternate versions of
>       deps by manipulating PYTHONPATH, but eggs/easy-install.pth breaks
>       this #709
>   - must rebuild source tree after moving it ('develop' pseudo-symlink
>     is absolute path?)
>   - setuptools makes life hard for packagers: they want a 'setup.py
>     install --prefix tmpdir' that never downloads anything, does not
>     want to modify existing files in the installdir, composes well with
>     other packages, and plays well with existing OS python packaging
>     policies
>   - way too many dependencies, 17 that I measured, some hard to justify
>     (zbase32? pyutil? mock?). Benefit of using dependency must dwarf
>     the overhead/pain of finding/downloading/building/installing it.
>     Small deps should be inlined (mock). Non-critical deps should be
>     dropped (zbase32). Larger deps should be carefully weighed (I'd
>     like to drop Nevow).
>     - what would it take to remove *all* non-python-stdlib
>       dependencies?
>     - OS packagers need separate packages for each dep, having lots of
>       them makes their life difficult, which hinders adoption. It is
>       much easier to depend upon popular already-packaged deps than
>       upon obscure/exotic ones. Adapting our code/approach to the
>       community will get tahoe packaged faster than asking them to
>       adapt to us. #703
>
> I think we've outgrown darcs. It's neat, it had features and
> conveniences that nothing else had at the time (interactive
> chunk-at-a-time commit was awesome), it still records more information
> and can deal with merges better than probably anything else, but it's no
> longer appropriate for us. It presents a barrier to entry for new users,
> it makes it hard to share code, and it adds a 15-second chunk of
> friction every time I try to do the tiniest little "darcs whatsnew -s",
> an operation that ought to take milliseconds.
>
> I'd like us to move to Git: it is fast, popular, well-supported, makes
> it easy to share code, and makes it easy to use VC for my own personal
> experiments (without creating a new tree, and rebuilding all the
> dependencies, for each local branch I want to manage). I'm doing all my
> local development in Git now, I can share code on github, and I only
> deal with darcs when I bridge my changes to/from the master on hanford.
> I worry about that bridge breaking, and I feel bad that it lags because
> I don't feel safe running it from a cron job, so I look forward to
> getting rid of it altogether.
>
> "Setuptools delenda est" (awesome turn of phrase from David-Sarah:
> "setuptools, it must be destroyed".. BTW the original was about
> Carthage). Zooko has spent the last three years of his life bending over
> backwards to deal with setuptools' bugs, misfeatures, incompatibilities,
> impedance mismatches, generally broken release/development process,
> maintainers, competitors, forks, toothpicks, patches, plugins, and
> packaging. We currently have an unreleased snapshot of an unacknowledged
> fork of an unmaintained tool that seems unwilling to do what we want it
> to do.
>
> To its credit, easy_install made my transition from a debian desktop to
> an OS-X laptop a bit less painful: like CPAN's command-line tools, it is
> an "awfully" easy way (both senses of the word are applicable) to get
> python code onto your sys.path . I like that convenience. However I
> freak out when I see it scanning the internet and installing (as root!)
> the first thing that looks vaguely applicable. Scraping download links
> off of project home pages (which could be world-writable wiki pages!) is
> a root compromise waiting to happen.
>
> Other than that, all of my experiences with setuptools have been
> negative ones: VersionConflict or PackageNotAvailable errors when I can
> import the given version just fine, multiple slow downloads and rebuilds
> of packages that are already there, opaque build processes for a
> non-compiled language, hundreds of lines of noise and multiple seconds
> of slowdowns preceding tests when nothing has changed. And frequently
> the recommended workaround is to modify something in my OS (as root),
> which violates the consistency of my OS packaging system. Building
> debian packages required a complete rewrite of setup.py and probably
> doesn't even work these days. All of the AMD buildslaves involved in
> debian packaging had to be "fixed" with various outside-of-dpkg changes
> to the point that I no longer had confidence that they represented stock
> lenny/karmic/etc systems and gave up maintaining them.
>
> So, what I want from our packaging system:
>
>   - users who start from a tarball and a reasonable OS should have an
>     easy, short sequence of commands to get to ./bin/tahoe --version
>   - tarball plus unreasonable OS (i.e. not having a compiler) should be
>     possible, if longer
>   - build from VC checkout should be no more than one extra step more
>     complicated than build from tarball
>   - OS packagers should have a 'make install PREFIX=' target like
>     they're used to, which doesn't download anything and plays well
>     with the OS-level policy of where things go
>   - fast, quiet, unobtrusive, controllable, easy to override, no
>     metadependencies
>   - tolerate older versions that are already installed by the OS
>     - never ever ask a user to uninstall something from the OS just to
>       get Tahoe working
>   - be able to get everything reasonable from tahoe-lafs.org in a
>     single download, be able to tell which download you need
>
> Dependencies: we have too many, and many of them are on pretty unusual
> things. Most of them have sub-dependencies on even more unusual things
> (pyutil, setuptools_trial). I'd like to use viz or dot or one of those
> graph-visualization tools to show the whole thing at once: it's kind of
> frightening, especially compared to the value we're getting out of some
> of them. There are advantages to moving from "write it yourself" to "use
> somebody else's library", and there are advantages to moving from
> "include their library in your source tree" to "ask the user to install
> that library first" to "try to auto-install that library", but there are
> also drawbacks. Each additional dependency puts more pressure on our use
> of setuptools, makes life more difficult for OS packagers, and makes it
> harder to get Tahoe running.
>
> Some, like Twisted, are pretty big and pretty central to our
> architecture (so we couldn't feasibly write-it-ourselves or include it
> in our source tree). But others, like the 8kB "mock" library (used only
> for unit tests), could just be inlined, or write-it-ourselves'ed.
>
> Here's a full list of the dependencies I was able to track down, with
> the primary ones (i.e. what Tahoe itself cares about) in the first
> column, and all secondary deps indented. Some secondary deps are
> referenced by multiple packages.
>
>  twisted
>  -zope.interface
>  nevow
>  foolscap
>  -pyOpenSSL
>  simplejson (stdlib in py2.6)
>  sqlite3 (stdlib in py2.5)
>  zfec
>  -setuptools
>  -darcsver
>  -setuptools_darcs
>  -argparse
>  -pyutil
>   -setuptools_trial
>   -simplejson
>   -argparse
>   -zbase32
>  pycryptopp (for AES, RSA, SHA256[stdlib in py2.5])
>  -setuptools, darcsver, setuptools_darcs
>  -setuptools_pyflakes, stdeb
>  pyasn1 (for twisted.conch/SFTP)
>  pycrypto (for twisted.conch/SFTP)
>
> Given that Tahoe only uses 4 files from zfec (about 30 lines of python
> and 1300 lines of C), it'd be nice if we didn't need to reference all 10
> of its sub-dependencies. Likewise it might be nice to find less
> dependency-heavy ways to get at AES and RSA (and use SHA256 from
> hashlib). I like pycryptopp better than pycrypto, but if we've committed
> to providing SFTP out-of-the-box, then maybe we should consider getting
> our AES and RSA from pycrypto and dropping one set of dependencies. Or
> maybe we should not commit to SFTP out-of-the-box, and build some sort
> of plugin scheme: maybe we could reduce the size of tahoe-core and
> reduce the dependency burden on folks who don't care about SFTP.
>
>
> = SOLUTIONS =
>
> My first step has been to set up a Git mirror of the tahoe tree, which
> has solved many of my problems with Darcs. It also makes it possible to
> easily publish a branch on which I'm experimenting with approaches to
> solve some of the other problems, available here:
>
>  http://github.com/warner/tahoe-lafs/tree/unsuck
>
> Here's what I'm trying out in that branch:
>
>  - rip out any mention of setuptools and pkg_resources
>  - provide a "setup.py check_deps" command which, in a subprocess,
>   attempts to import each known dependency, obtain its version with
>   e.g. "foolscap.__version__", and compare it against a known minimum
>  - provide "setup.py build_deps" which looks for tarballs of
>   pre-determined names in a local directory (i.e. ../tahoe-deps/),
>   unpacks them in a new subdir, and runs their "setup.py install" (with
>   --single-version-externally-managed where necessary) to make them
>   available in support/lib/pythonX.Y/site-packages
>  - the build is performed by os.execv(sys.executable, "setup.py",
>    "install"), so it always uses the same version of python as was used
>    to run the top-level "setup.py build_deps", and does not search PATH
>  - each build is done in a separate process, so the results of one
>    build are available for import by the next
>  - builds are only done when the check_deps import fails or gets an old
>    version
>  - remove the setuptools+entrypoints -generated bin/tahoe with a script
>   like the original tahoe-0.2.0 (june-2007) bin/tahoe, which adds
>   TOP/support/lib/pythonX.Y/site-packages to the import path and then
>   does a simple 'from allmydata.scripts.runner import run; run()'
>  - the new bin/tahoe modifies PYTHONPATH and re-execs itself. This
>    allows eggs in the support/ dir to be processed, and ensures that
>    subprocesses can do the same. The original one modified sys.path
>    instead, and was hard to get working right for subprocesses
>  - stop exec()ing standalone trial and twistd (for 'setup.py test' and
>   'tahoe start', respectively). Instead, import those modules from
>   Twisted and invoke their functions from within python. This removes
>   the need to duplicate the shell's "which" builtin, and removes some
>   of the difficulties on windows where you might be executing trial, or
>   trial.py, or trial.bat, or trial.exe, or something.
>  - this also removes a subprocess from the 'tahoe start' path, so that
>    problems during tahoe.cfg loading or library importing are reported
>    in the parent process, rather than being buried in the logs and a
>    parent which emits "node probably started".
>  - the new rule is: never exec() something that didn't come with Tahoe
>    or with Python, our two fundamental dependencies
>  - this also removes a lot of noise and slowdowns from "setup.py test"
>  - relatedly, make "setup.py test" delegate to twisted.scripts.trial,
>   basically by inlining some of the contents of setuptools_trial but
>   removing a lot of the wrapper layers. The fact that distutils
>   commands cannot just pass all of sys.argv into the child is a drag,
>   as it means you must create new distutils options for every feature
>   of the underlying tool you wish to expose. Also simplified things by
>   allowing trial.run() to do its own sys.exit(), rather than trying to
>   call into an internal function to avoid that.
>  - this means that e.g. 'python setup.py test othercommand' won't work:
>    it will never run othercommand. But it's much simpler and more
>    robust this way.
>  - move all version-measurement code out of __init__.py and into
>   util/versionutil.py . I can't stand having lots of code in
>   __init__.py, especially code which spawns subprocesses during
>   'import'. Also I made the output of "tahoe --version" more legible
>   (one package per line), and I'm planning to change it to only emit
>   tahoe's version unless you add a --all flag.
>  - generate src/allmydata/_version.py (which contains a string, not a
>   pyutil_Version or distutils.version.LooseVersion instance) by
>   checking git metadata, which can be run in O(1) time.
>  - copy mock.py into src/allmydata/tests/ to remove it as a dependency.
>   If/when a newer version comes out, it's easy to copy that version
>   into place.
>
> The next directions I'm likely to go are:
>
>  - add a flag to 'setup.py build_deps' that will allow downloads of
>   needed dependency tarballs that are not found locally. These tarballs
>   will be identified by hash and fetched with urllib, so no external
>   tools are necessary, and nothing is added to the reliance set
>   (network-side attackers cannot inject their own code into your
>   dependencies).
>  - add another flag to allow dependencies to be satisified by
>   pre-compiled eggs, still identified by hash, probably hosted on
>   tahoe-lafs.org . This should let the build-deps process work on
>   systems without a compiler. Ideally, the eggs would be dropped into
>   support/lib/pythonX.Y/site-packages/ where the PYTHONPATH-based
>   lookup could find them, but I don't know if this would work without
>   the easy-install.pth files. Perhaps the PYTHONPATH-setting logic
>   could look in support/ for .egg files and add them directly to
>   PYTHONPATH, but again I'm not sure that would be enough.
>  - worst case, add a flag to offer to do "easy_install FOO" and put the
>    result in support/ , giving up on safety against trojan downloads.
>  - figure out some SUMO.tar -building automation, since the new scheme
>   recognizes more dependencies than are currently in our SUMO tarball
>   (e.g. mock, pyasn1, newer versions of many)
>  - maybe build a single-file "check my system" script, and suggest that
>   first-time users who are unsure of which tarballs to grab could start
>   by running that script, which would check for all dependencies and
>   then tell the user to either download the SUMO tarball or the smaller
>   one. If we make sure that SUMO is comprehensive, it may get bigger,
>   and it'd be nice to avoid a large download when it's not necessary
>  - find ways to reduce dependencies
>  - see about removing Nevow in favor of manually-generated HTML pages.
>    We really aren't using most of Nevow's features.
>  - require python2.6 and use stdlib json/sqlite instead of 3rd-party
>    dependencies
>  - do something about the dependency load of zfec/pycryptopp: maybe
>    embed them, maybe talk zooko into improving them, maybe come up with
>    replacements, not sure
>  - improve test automation: build an schroot/debroot -based set of VMs
>   with various debian/ubuntu releases that we care about (maybe on the
>   newly-donated hardware), with well-defined package installs (e.g.
>   both with and without foolscap), and add them to the buildbot. Maybe
>   create them on-demand on EC2 instances.
>  - improve debian packaging: start making tahoe .debs again. Make sure
>   there are .debs for all our dependent packages so the tahoe.deb can
>   actually be installed. Build test automation to assert this.
>  - handle version numbers as something passed into each command, rather
>   than being embedded in the source code. In particular, running
>   "setup.py" or "./bin/tahoe" should compute a version number and make
>   it available to all descendant processes. Running an installed
>   /usr/bin/tahoe should use an embedded version. This is a direction I
>   want to try on a number of build systems, as I think it may make more
>   sense overall.
>  - rip out any notion of "appname", which I think was a mistake that
>   resulted from darcs version "numbers" being insufficient to capture
>   branch information
>
>
> Anyways, that's what I've been playing with in the last few weeks.
> Please let me know what you think, and take a look at that branch. I'm
> sure this will stir up some strong feelings: I'm eager to hear what
> people feel about this kind of approach.
>
> I've not tried this on windows yet, and I'm sure something will be
> broken there, but I believe that it should be possible to achieve both
> simplicity and works-on-windows, just as I'm sure it should be possible
> to get something that is easy to use on sensible systems (e.g. debian
> with all dependencies except for tahoe installed) and comfortable to use
> on less-fully-featured systems too.
>
> cheers,
>  -Brian
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at tahoe-lafs.org
> http://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev
>



More information about the tahoe-dev mailing list