mini-Summit report, day 2

Fri Jul 11 22:55:53 UTC 2014

different inline responses:

On Wed, Jul 2, 2014 at 1:05 PM, Brian Warner <warner at lothar.com> wrote:
>
> Summary of the second day. Brian, Daira, and Nathan met in a coffee
> shop, and we:
>
> * built a branch that replaces Tahoe's build process with the "create a
>   virtualenv, peep-install everything into it, make bin/tahoe run
>   venv/bin/tahoe" technique that Brian used in Petmail. Modulo
>   bugs/limitations in Peep, this is "safe", in that it checks
>   pre-specified hashes on all tarballs before running any code from
>   them. If you managed to get a "good" copy of tahoe's source code, this
>   would ensure that you only use "good" code for tahoe's dependencies.
>
>   We actually wrote two separate versions of this (racing against each
>   other, dueling-laptops style). My (Brian's) version is published at
>   https://github.com/warner/tahoe-lafs/tree/venv , and also strips out a
>   lot of code that was annoying me (version string management, almost
>   everything in __init__.py and setup.py, setup.cfg, _auto_deps.py).
>
>   (basically, tahoe's current "install stuff into support/" trick was
>   something I came up with 7 years ago before virtualenv was a thing,
>   and these days we should just use virtualenv. It doesn't even require
>   that you have setuptools installed first, since it brings its own copy
>   of virtualenv, which includes a copy of setuptools and pip)
>
>
> * discussed what a ideal green-field programming language would use for
>   dependency management. The actual runtime code would identify imported
>   modules by hashes, and modules would be deep-frozen, so importing the
>   module should be indistinguishable from interpolating the source code
>   of the module. Developers would have one phase where they select their
>   dependencies (using local petnames for convenience, but their
>   development environment knows how to map those petnames to specific
>   module hashes), then a separate step where the code gets compiled or
>   translated into non-petname hash-based modules, for execution.
>
>   Module authors (and others) would publish signed "mod XYZ would be a
>   good replacement for mod ABC" links (edges in a graph, where the nodes
>   are module hashes) to indicate newer versions or forks, but these
>   would not be followed automatically. Ideally the developer would look
>   at the changes (or the recommendations) to make decisions, and for
>   not-entirely-compatible changes, the edges would include information
>   about how you need to update your calling code to match. This could
>   guide a tool (like python's 2to3) that search through your codebase
>   for things that need changing. Imagine a screen that says "to update
>   from dependency v1 to v2, you must changes the following 4 call
>   sites".
>
>   The runtime tool should accept overrides from the ops-folks/admins to
>   say "I know you wanted hash 123, but you should accept hash 456
>   instead", to deploy security updates faster than the upstream author
>   can change their locally-declared dependencies. In general, all
>   modules declare their dependencies with strong references (hashes).
>
>   (Brian did some work in this space back in the Jetpack days:
>   http://people.mozilla.org/~bwarner/jetpack/components/ )
>

Since that discussion I've realized an issue with the status quo for
dependency specification.  Suppose package C depends on A version 1,
and B versions 1, so it declares some dependencies somehow similar to
this:

A == 1.0
B == 1.0

Later, a new release of A comes out, and the C devs cautiously ensure
that C works with this new version, so they update their dependency
specification (they don't use > or < because they are so careful that
they spell out each tested release exactly, suppose):

A == 1.0 or A == 2.0
B == 1.0

Later a new version of B is released, the C devs repeat their rigorous
testing and validation, and update the specification of dependencies:

A == 1.0 or A == 2.0
B == 1.0 or B == 2.0

I believe this is a very common pattern across languages and packaging
systems.  However this specifies a cartesian product of dependencies.
The C devs only ever tested { (A.1, B.1), (A.2, B.1), (A.2, B.2) } and
they never tested { (A.1, B.2) }.

So a more "cautious" dependency expression system would allow
restrictions which aren't cartesian products:

[ (A == 1.0 or A == 2.0) and (B == 1.0) ] or [ A == 2.0 and B == 2.0 ]

I don't know of a clean way to do this for python package dependencies
which could be applied to tahoe-lafs.  We could specify a very exact
range of dependencies at the cost of introducing more brittleness.

The status quo (for most packaging systems) is probably to rely on the
assumption that there are no bugs unique to (A.1, B.2) or more even to
rely on the package announcement/distribution system (ie
pypi.python.org) to never present (A.1, B.2) at any time, so that the
chance of users installing that configuration is low.

Notice that if package releases are individually signed/verified
cryptographically, an attacker who wants to produce an (A.1, B.2)
configuration might execute a rollback attack by preventing a victim
from seeing A.2 while allowing B.2 through...  A defense against this
might be for pypi to sign the manifest of all versions at once so that
no manifest ever says (A.1, B2).

> cheers,
>  -Brian
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at tahoe-lafs.org
> https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev