[tahoe-dev] Karmic packaging updates

Brian Warner warner at lothar.com
Tue Aug 25 03:58:56 UTC 2009

Karmic packages have been uploaded to REVU for review. They seem to be
stuck somewhere, though, and haven't shown up on the main page. Once
that's resolved, the next step is to add comments to them, and solicit
reviews. (given the large number of packages already in REVU, some of
which are in much better shape than tahoe and have been there for
months, I can't say I'm optimisic about it being accepted before the
wednesday freeze. But at least we've started the process.)

For reference, the full debian build process works like this:

 * start from an "upstream tarball" and an unpacked+patched source
   directory (with the new debian/* files, and any necessary source
   packages). Run "debuild" in this directory, which does everything
 * "debian/rules clean", to remove anything you've left over from
   earlier packaging attempts. The default rules use "setup.py clean".
 * create the "debian .diff.gz": unpack the upstream tarball to a
   separate directory, "diff -r" between that upstream dir and the
   current dir, compress and save the results as PACKAGE-VER.diff.gz
 * hash the upstream tarball and .diff.gz, write that into a gpg-signed
   ".dsc" file, to securely describe the "source package"
 * "debian/rules build", to compile the package (for whatever that
   means; in a python package, it usually means 'setup.py build', but
   for a pure-python project, that doesn't really do anything important)
 * "debian/rules binary", to install the package into a working
   directory (i.e. "setup.py install --root=LOCALDIR"), then turn that
   into a .deb package
 * hash everything (.orig.tar.gz, .diff.gz, .dsc, and new .deb files)
   into the gpg-signed ".changes" file, which serves as a handle for
   upload processing (uploads are only accepted when signed by the
   appropriate debian developer)

Zooko Wilcox-O'Hearn wrote:
> Could you be more specific?  What went wrong before you ripped out  
> most of the custom stuff in our setup.py files?  Is it a bug in our  
> custom setup.py stuff or should we just treat Debian packaging as a  
> sufficiently different use case that the setup.py shouldn't be  
> expected to handle it?

Sure thing.. here are some details.

 * invoking setup.py, even to do "setup.py clean", creates files (it
   downloads and builds the setup_requires= packages, like darcsver,
   setuptools_darcs, etc, and creates .egg files in the top-level
   directory). These show up in the debian .diff, making it appear like
   the creation of these eggs are a debian-specific requirement. I had
   to change the rules to use the Makefile's "make clean" target instead
   of "setup.py clean". Since "make clean" also invokes "setup.py
   clean", the result is that we first download/build darcsver/etc, then
   delete them, then compute the diff.

 * sometimes, invoking setup.py tries to download a file (darcsver or
   pyutil or argparse or whatever). Sometimes this is because setuptools
   is confused about the download sources, sometimes it's because the
   dependencies have been updated but a new .egg/.tar.gz hasn't been
   embedded in the source tree. This is verboten, as it would cause a
   Fails-To-Build-From-Source error for someone who retrieved the debian
   source package but was unable to fetch anything else at build time.
   (debian packages are *always* supposed to be desert-island builds).
   To enforce this, the debian/ubuntu build daemons (known as "buildd"s)
   are not allowed outside network access.

 * "setup.py build" was doing too much work. For builds from released
   tarballs, there's no need to install and run darcsver (since the
   version number is already baked in), and byte-compilation is not
   necessary (since pysupport/pycentral handle that at installation
   time, for whichever versions of python are present on the target
   system.. they pass --no-compile to "setup.py build").

 * Tahoe's "setup.py build" was doing *way* too much work: because
   setup.cfg aliases "build" to "develop" (and more), the debian package
   build process was trying to download and compile pycryptopp too (and
   foolscap, and zfec, and everything that tahoe wants for runtime but
   which the debian package didn't declare as a build-time dependency).
   When we really didn't need any of that: this step only needs
   "setup.py install" to copy the .py files into the right place.

 * most of the extra bells and whistles in these setup.py files are used
   to improve behavior when building on a system that doesn't have the
   dependencies already installed (where setuptools is the "first
   responder" with enough intelligence to attempt to provide these
   dependencies). For a debian build, the debian/control and .dsc files
   clearly specify and enforce the dependencies, so the setup.py doesn't
   need to be so smart. In addition, building from a released tarball
   doesn't need darcsver. The build process got much easier when I
   removed those setup_requires= entries from setup.py.

 * incidentally, to remove zfec's dependency on the unpackaged
   pyutil/argparse libraries, I had to delete a bunch of files: easyfec,
   filefec, and the tests which referenced them. I'll reiterate my
   earlier statement: the overhead of depending upon a small library can
   easily overwhelm the benefit of using it, especially in the context
   of a well-established (and therefore harder-to-get-stuff-added-to)
   packaging system like debian/ubuntu. The amount of work is roughly
   proportional to the number of packages, regardless of their size. I
   think that zfec was only using like three tiny functions.. if they
   were instead copied into zfec, then the debian package could include
   filefec/easyfec. As it stands, you'll have to convince debian/ubuntu
   to accept a "pyutil" package too.

>> in particular I'd like to make "setup.py build" go back to normal and
>> document some other command (perhaps "setup.py build_tahoe"?) that
>> users should run after they unpack the source tree and before they
>> invoke ./bin/tahoe for the first time.
> I'm not sure exactly what the issues you experienced were, but I've  
> seen problems in the past due to our desire to have a command,  
> currently "python ./setup.py build", which makes a "bin/tahoe"  
> executable that can "run from source" without being installed and  
> without having to re-run build after every change to the source.   

The issue here was that the debian/python packaging tools (in particular
pysupport/pycentral) expect "setup.py build" and "setup.py install" to
do some fairly specific things:

 * build: compile any C/C++ extensions
 * install: copy .py/.so into --root directory

(for python, the split between the two is fuzzy, at best. Usually I
expect a "build" step to compile things but not touch anything outside
the source tree, and to run fine as a non-root user. I expect the
"install" step to copy things into the target directory [outside the
source tree], to not compile anything, and to require running with
enough privileges to modify the target directory. I usually run 'make'
and then 'sudo make install', and expect to have no root-owned files in
the source tree. I think the debian process has similar expectations,
given that "debian/rules build" is run as non-root, whereas
"debian/rules binary" gets run under fakeroot)

For the benefit of tahoe users (as opposed to debian packagers), we've
provided a way to do more than that:

 * evaluate the need for dependencies, download them, build them, stuff
   them into the source tree (in support/lib)
 * perform magic to get live tahoe source code in an easily importable
   place (also support/lib, via "allmydata-tahoe.egg-link")
 * generate entry-point scripts, put them in a known location
 * make sure that this bin/tahoe is usable from anywhere (via symlinks),
   as long as the source tree remains in place

Now, I really like the availability of this feature: it reduces the
get-things-started instructions to two steps:

 * type "make"
 * type "bin/tahoe --version"

It might be useful to have a brief history lesson. When Zooko moved us
away from Makefiles and closer to setuptools commands, that "make"
invocation turned into "python setup.py build". This removed one use of
the Makefile (which, I think, Zooko cared about, and which I didn't care
about, because I use Makefiles for this purpose all the time, whereas I
think he thought it weird to have a Makefile in a python project, kind
of like finding a crowbar in your pencil drawer), but removed an
abstraction layer. Instead of being able to pile arbitrary complexity
behind a single "make" command, any of our changes had to either go into
a customized setup.py (overriding the 'build' class), or into a
setup.cfg (overriding the 'build' command, and referencing other
commands which were either implemented in setup.py or in external
modules like darcsver).

The Makefile rules that created the initial _version.py (or, later,
recreated it with every command) were replaced by setup.cfg aliases to
transform the "build" command into a "darcsver build" duo, because then
the instructions could continue to say "setup.py build" instead of
requiring "setup.py darcsver build". This grew over time, until finally,
to get all the dependencies built, and a source tree set up, and to
generate bin/tahoe, the alias became "build = darcsver develop
make_executable build".

To get the "run $SOURCE/bin/tahoe from anywhere" behavior, bin/tahoe has
code to find itself (i.e. compute $SOURCE), find the tahoe source (i.e.
$SOURCE/src, where the 'allmydata' module lives), add it to sys.path,
then invoke the normal CLI frontend in allmydata.scripts.runner . This
code started out as a way to compute the right build/lib directory,
since that's the one is populated by 'setup.py build', and once upon a
time the python version number got embedded in the path somewhere. When
automatically-built dependencies showed up, that sys.path manipulation
also added $SOURCE/support/lib/pythonX.Y/site-packages . (at this point,
we could probably rely upon the latter and stop using the former, since
we've got that .egg-link file anyways).

To support windows (and users who run setup.py with a python that
differs from what "/usr/bin/env python" would find, like somebody who's
running python from its source tree, such that bin/tahoe's shbang line
needs to match), the rules were changed to generate bin/tahoe (instead
of having it as a source file), by modifying a template file. The
'make_executable' command probably does that.

If tahoe is actually installed (and the sys.path/$PATH -setting code
isn't necessary), a setuptools "entry-point" script can be used, like
the one that 'setup.py develop' creates in support/bin/tahoe .
('setup.py install' also writes this script, so when the debian package
is created, it contains the entrypoint script instead of the original
bin/tahoe). These entry-point scripts perform additional runtime
checking on the version of tahoe and its dependencies, which sometimes
doesn't work (on older versions of debian/ubuntu, where things like
simplejson have the wrong .egg-info data). The entry-point script
consists (mostly) of a single line:


My biggest beef with entrypoint scripts is that they don't give the
end-user (who reads the contents of /usr/bin/tahoe to learn more about
it) *any* clue as to where the source code is coming from. The
load_entry_point function itself must hunt through sys.path looking for
.egg_info files/directores with a matching package name, and then at
some piece of metadata (*.egg-info/entry_points.txt) to look up
[console_scripts]tahoe, which then points to an importable module and
function name. No amount of grepping for a file named
'*allmydata-tahoe*' will find that (since it's actually named
allmydata_tahoe.egg-info, with an underscore). I vastly preferred the
original bin/tahoe script, which ended with an easy-to-research:

 import allmydata.scripts.runner; runner.run()

Finally, all 8 major setup.py commands (build, test, sdist, install,
bdist_egg, trial, sdist_dsc, test_mac_diskimage) have "darcsver
--count-all-patches" prefixes in front of them in the setup.cfg
[aliases] section. This frustrates me, because darcsver (and darcs in
general) is very disk-io intensive and takes 10 seconds to run on my
FileVault'ed OS-X box (with a cold cache), probably because it has to
read all 4000+ patches to figure out which ones are tags. Having
darcsver in the aliases means it's run on every single setup.py command,
making this slowdown more painful. This is why I added "make quicktest"
to invoke the tests without setup.py involvement.

I believe that Zooko added a 'darcsver' prefix to those commands to
ensure that _version.py would be created before anything which might try
to read it. The original Makefile had a rule for this, but which only
got run once (when _version.py didn't exist). This avoided a lot of
time-consuming darcsvering, at the cost of not getting updated when
you'd pulled more patches. For a while we had the rule configured to run
all the time (except for my 'quicktest' target).

Ok, well, it looks like that so-called history lesson was actually a
thinly-veiled rant about the state of our packaging system. Oops :-).

So, what I'd like is for us to continue to have a command that fits into
the first line of our instruction manual: the one that used to be "make"
and is now "setup.py build" (but really maps to "setup.py darcsver
--count-all-patches develop --prefix=support make_executable build").
But I'd also like for the debian packaging tools to be able to have
"setup.py build" and "setup.py install" work as they expect. I'd like to
not re-run darcsver on every setup.py command, or I'd like darcsver to
run in constant time with one disk read. I'd like to be able to compute
a version string without downloading stuff at build time.

I think the 'develop' command does most of what we want, with the
dependent-library-building and tahoe-egg-link'ing: this gives us a fixed
place to add to sys.path that will handle live tahoe code (i.e. you
don't have to re-run it each time the code is changed) as well as any
dependent libraries. I suspect that some of our
find-yourself-and-update-sys.path code is redundant; we should do a pass
to see what we're doing to support bin/tahoe in general, what we're
doing to support windows, what we're doing to support unit tests, what
we're doing to support the fact that "bin/tahoe start" now uses an
intermediate process (which I think should go away), and how some of it
might be cleaned up.

I think that it might be reasonable to have code in setup.py that
creates _version.py if it does not already exist, rather than having
every command get aliased to include an invocation of "darcsver". It'd
be even nicer if there were a cheap way to determine when darcsver needs
to be re-run, but we may have to switch to Git or something else to get
O(1) version numbers with a one-liner instead of a whole extra module.
(incidentally, "git describe HEAD" produces just this, and runs in less
than a second with a cold cache, and in 10ms with a hot cache).

It may be reasonable to define a new command for users to run, instead
of "build" or "develop" (since both have established meanings and
expected behaviors). What if the instructions were?:

 * type "python setup.py build-tahoe"
 * type "python bin/tahoe --version"

The admonition to use "python" when running bin/tahoe would put a slight
burden on non-installed-python users, while removing the need for the
template-based script and the 'make_executable' step. (normal users who
use their installed python could rely upon the #!/usr/bin/env python and
just type "bin/tahoe").

The "build-tahoe" command would do darcsver+develop, and could even
finish by invoking "bin/tahoe --version" to make sure it works.

> On the other hand, maybe it is okay if the Debian diff contains a big
> patch that simplifies and re- arranges the build system just for the
> purpose of building .debs. Does that cause any harm?

Not really. It probably raises some eyebrows among the debian packagers,
wondering why the tahoe build process is so weird that they/we/I have to
throw out most of it. Usually the debian .diff.gz only modifies the
upstream sources (outside of the debian/* files) to fix important bugs
or get the package to play nicely with debian-specific layout policies.
The general hope is that everything outside of debian/* has been
forwarded for inclusion upstream, and for the .diff.gz to get smaller
over time.

> 2.  That you can build a .egg with "python ./setup.py bdist_egg" and  
> then install it with "easy_install $EGG" and this results in a  
> working install.  This gets tested for Tahoe-LAFS, zfec, and  
> pycryptopp on all of the aforementioned platforms and they all pass.

Is this the "install-egg" buildstep? The one that's red on five
"Supported" builders right now? I've never really understood that test.

> 5.  That if you build and then run "make clean", that it doesn't  
> leave cruft behind.  This gets tested fro Tahoe-LAFS on one Linux  
> builder.  It passes.

I've seen that fail on OS-X, where it downloads and builds things like
pycryptopp unnecessarily (the bug where different setup.py commands put
.eggs in different places, some of them different than where "make
clean" is prepared to find them). Also, it tests "make clean" instead of
"setup.py clean", which is why the debian rules had to be customized.


More information about the tahoe-dev mailing list