[tahoe-dev] Bringing Tahoe ideas to HTTP

Thu Sep 3 04:06:33 UTC 2009

While Brian focused on browser-side improvements that could bring some
Tahoe-like features to the boring ol' web, I have been more interested
in extracting useful orthogonal features out of Tahoe-LAFS on the
other side: From the storage services upwards.

For starters, I imagine defining a simple Content-Hash-Keyed blob
storage API to which one may put a blob and receive a CHK for that
blob, and from which one may request a blob by key.  The put operation
may return a CHK key, but in promoting simple implementation, I'm
interested in extremely simple CHK derivations, such as a simple hash
of contents (without encryption).  (This lets clients do clever
bandwidth saving tricks, for instance.)

Now this service is incredibly simple to understand and implement.

Notice, this API is a subset of the WAPI.  So is it possible to keep
extracting other orthogonal features, such as CHK encryption for
immutable files, erasure coding, mutable files, and lease management
in piecewise increments, where each feature set is a fairly
independent and easy-to-implement chunk?

Consider three different architectures:

A. Trivial blob_storage:

client <-> blob_storage

B. Immutable Blobs with Confidentiality+Integrity:

client <-> encryption_layer <-> blob_storage

C. Immutable Blobs, Confidentiality+Integrity , Erasure-Coding:

client <-> encryption_layer <->
erasure_coding_layer_with_peer_selection <=> blob_storage

In these diagrams every component except for "client" has a simple
HTTP RESTful CHK API.  Also, I'm making important assumptions about
authentication and transport security between links in order to
preserve guarantees for the layers to the left of the link.

Case A is similar to S3 or Azure Simple Storage Service's blob API,
which implies one of the justifications for doing this kind of feature
decomposition.

Case B is similar to Duplicity over S3.

Case C is similar to Tahoe with only Immutable files.

This architectural decomposition appeals to me for three reasons:

a. Disjoint Features:

There are perhaps many *different* use cases of the three diagrams
above.  Many of these use cases we haven't imagined yet.  Tahoe-LAFS
packages many orthogonal features together.  This is handy and removes
configuration and usability (and therefore security) headaches a more
open architecture might have.

However, the full Tahoe-LAFS feature set may be overkill for many user
cases.  Also, there may be alternative designs at each layer which
could be simultaneously employed.

For example Case C is the only case I notice that explicitly requires
a peer selection strategy and therefore requires some kind of grid or
introducer component.

b. Similarity to Other Systems:

In this architecture it's easy to imagine replacing components with
existing systems.  For example, if we distinguish between blob_storage
implementations which enforce CHK consistency and those which do not
(placing the burden solely on clients), then using S3 or Azure Simple
Storage Service in a particular way may qualify as the latter type of
blob_storage.

This would appeal to users of existing systems who want some
underlying guarantees such as SLAs or integration into existing cost
models.

c. Easy to implement:

This is a really important goal.  If I were stuck on a desert island
with nothing but a loin cloth and an internet-less laptop, I could not
implement Tahoe-LAFS as it stands today.  But I could implement some
of the disjoint components.

It's easy for implementors to start independent implementations on the
right side of diagrams.  Even a mediocre developer could implement a
prototype of blob_storage with little time and understand how it
operates.

There are some drawbacks which come from this separation, but I
haven't done a thorough job of considering them.  Here's a quick brain
dump:

d. Efficiency.
e. HTML (and SSL?) suckage.
f. There may be abstraction-penetrating features that cannot be decomposed.
g. Consistency - Backwards compatibility and legacy problems increase
when the architecture becomes more disjoint.

Finally, in terms of implementation, it seems plausible that the
lowest layers of Tahoe-LAFS can start migrating to this more open
architecture piecewise.

For instance maybe the next release will use a RESTful CHK blob
storage API with leasing.  Other systems could use the same nodes for
other purposes, but Tahoe-LAFS users should notice little usability
difference.

I'm interested in tahoe-dev's reaction to this proposal.  Thoughts?

Nathan

On Thu, Aug 27, 2009 at 2:02 PM, Brian Warner<warner at lothar.com> wrote:
>
> At lunch yesterday, Nathan mentioned that he is interested in seeing how
> Tahoe's ideas and techniques could trickle outwards and influence the
> design of other security systems. And I was complaining about how the
> Firefox upgrade process doesn't provide the integrity checks that I want
> (it turns out they rely upon the CA infrastructure and SSL alone, no
> end-to-end checking; the updates and releases are GPG-signed, but
> firefox doesn't check that, only humans might). And PyPI has this nice
> habit of appending "#md5=XYZ.." to the URLs of the release tarballs that
> they publish, which is (I think) automatically used by tools like
> easy_install to guard against corrupted downloads (and which I always
> use, as a human, to do the same). And Nathan mentioned a class of web
> attacks in which a page, loaded over SSL, imports something (JS, CSS,
> JPG) via a regular http: URL, and becomes vulnerable to third-parties
> who can take over the page by controlling what arrives over
> unauthenticated HTTP.
>
> So, setting aside the reliability-via-distributedness properties for a
> moment, what could we bring from Tahoe into regular HTTP and regular
> webservers that could improve the state of security on the web?
>
> == Integrity ==
>
> To start with integrity-checking, we could imagine a firefox plugin that
> validated a PyPI-style #md5= annotation on everything it loads. The rule
> would be that no action would be taken on the downloaded content until
> the hash was verified, and that a hash failure would be treated like a
> 404. Or maybe a slightly different error code, to indicate that the
> correct resource is unavailable and that it's a server-side problem, but
> it's because you got the wrong version of the document, rather than the
> document being missing altogether.
>
> This would work just fine for a flat hash: the original file remains
> untouched, only the referencing URLs change to get the new hash
> annotation. Non-enhanced browsers are unaffected: the #-prefixed
> fragment identifier is never sent to the server, and the <a name=> tag
> is fairly rare these days (and would still mostly work). Container files
> (the HTML which references the hashed documents) could be updated to
> benefit at leisure. Automation (see below) could be used to update the
> URLs in the containers whenever the referenced objects were modified.
>
> To improve alacrity on larger files, Tahoe uses a Merkle tree over
> segments of the file. This tree has to be stored somewhere (Tahoe stores
> it along with the shares, but it would be more convenient for a web site
> to not modify the source files). We could use an annotation like
> "#hashtree=ROOTXYZ;http://otherplace" to reference an external hash tree
> (with root hash XYZ). The plugin would start pulling from the source
> file and the hash tree at the same time, and not deliver any source data
> until it had been validated. The hashtree object would need to start
> with the segment size and filesize, so the tree could be computed
> properly. For very large files, you could read those parameters and then
> pull down (via a Range: header) just the parts of the Merkle tree that
> were necessary. In this case, the automation would need to create the
> hash tree file and put it in a known place each time the source file
> changes, and then updated the references.
>
> (note that "ROOTXYZ" provides the "identification" properties of this
> annotation, and "http://otherplace" provides the "location" properties,
> where identification means the ability to recognize the correct document
> if someone gives it to you, and location means the ability to retrieve a
> possibly-correct document. URIs provide identification, URLs are
> supposed to provide both.)
>
> We could compress this by establishing an (overriable) convention that
> http://example.com/foo.mp3 always has a hashtree at
> http://example.com/foo.mp3.hashtree, resulting in a URL that looked like
> "http://example.com/foo.mp3#hashtree=ROOTXYZ". If you needed to store it
> elsewhere, you could use "#hashtree=ROOTXYZ;WHERE", and define WHERE to
> be a relative URL (with a default value of NAME.hashtree).
>
> == Mutable Integrity ==
>
> Zooko and I have both run HTML presentations out of a Tahoe grid (which
> makes for a great demo), and the first thing you learn there is that
> immutability, while a great property in some cases, is a hassle for
> authoring. You need mutability somewhere, and the more places you have
> it, the fewer URLs you have to update every time you change something.
> In technical terms, you frequently want to cut down the diameter of the
> immutable domains of the object DAG, by splitting those domains with
> mutable boundary nodes. In practical terms, it means you might want to
> publish *everything* via a mutable file. At the very least, if your web
> site has any internal cycles in it, you'll need a mutable node to break
> the cycle.
>
> Again, this requires data beyond the contents of the source file. We
> could use a "#sigkey=XYZ" annotation with a base62'ed ECDSA pubkey (this
> would provide the "identification" property of the constant pubkey), but
> we'd still need to know where to get the actual signature (the
> "location" property of the variable signature). We could do
> "#sigkey=XYZ;sigurl=http://otherplace". Or we could establish a
> convention of keeping the signature files next to the source files with
> "#sigkey=XYZ;sigsuffix=.sig" (and then http://example.com/main.css would
> have its signature stored in http://example.com/main.css.sig). Or,
> compress the convention further and have "sigkey=" imply
> "sigsuffix=.sig" unless overridden.
>
> This would involve two GETs, but they'd be done in parallel, and the
> original files would remain untouched (thus unaware browsers would be
> unaffected, obliviously content in their insecurity). The immutable
> "#hashtree=" would also involve two parallel GETs, but presumably it'd
> only be used for large files, in which the overhead would be less
> noticeable. Whereas the mutable "#sigkey=" would be used for even small
> files, so you might notice the overhead more.
>
> The .sig file would probably contain a copy of the pubkey too, for local
> verification purposes. If we used a signature scheme that didn't give us
> short-enough pubkeys, the .sig file would contain the whole pubkey, and
> the #sigkey=XYZ suffix would contain its hash.
>
> == Encryption ==
>
> Now, how could we provide fine-grained confidentiality? We all know how
> broken the SSL+CA model is. Tahoe uses per-object encryption keys that
> are tightly bound to the object identifiers, providing obj-cap
> properties (like fine-grained delegation) and also honoring the
> end-to-end argument.
>
> Obviously, this step requires abandoning the unmodified browser. Goodbye
> unmodified browser! Now, the plugin-enhanced browsers that are left can
> recognize a new URL scheme. Let's call it "x-yzzy:" for now (I don't
> want to use "tahoe:" for this purpose, since I still want that for
> *distributed* secure files). These URLs will look like
> "x-yzzy://example.com/READKEY.UEBHASH", and behave just like Tahoe
> immutable readcaps for 1-of-1 encoded files except they reference the
> single host where you can get the sole share (instead of permuting an
> out-of-band serverlist to find a set of likely places for k shares). The
> READKEY would be hashed to form a storage-index, then the plugin would
> fetch http://example.com/STORAGEINDEX (base64-encoded), which would
> contain an encrypted+hashed version of the plaintext. The hash
> information would include both a flat hash and a merkle tree, covered by
> a UEB just like in tahoe (except we could drop the block hash tree since
> k=1).
>
> For mutable files, the URL would be "x-yzzy://example.com/MUTREADKEY",
> which would be even shorter (2*kappa instead of (1+2)*kappa, if I'm
> remembering the necessary length of the hash correctly). Again,
> MUTREADKEY is hashed to form a storage-index, the corresponding
> ciphertext+hashes+signature file is fetched, the hashes checked, the
> signature checked, the data decrypted, and delivered to the caller.
>
> Web servers would be completely unaffected: they'd just have directories
> full of base64-encoded (or base62, or a modified base64 without "/", or
> whatever) filenames, which they serve to anyone who cares. All GETs
> would use unencrypted http, since this protocol would provide both
> integrity and confidentiality.
>
> Oh, and the rule would be that the storage-index would be treated as a
> URL relative to the http equivalent of the original x-yzzy URL. So
> "x-yzzy://example.com/subdir/READKEY.UEBHASH" would get an encrypted
> blob from "http://example.com/subdir/STORAGEINDEX".
>
> == Tools ==
>
> You'd start with a hashing tool: given a file, emit the "#hash=XYZ"
> suffix that should be tacked on to the URL. Or, given an URL prefix and
> a webroot-relative filename, emit the whole URL.
>
> Then you'd move on to the merkle tree generation tool. Given FILENAME,
> it writes the hash tree data to FILENAME.hashtree, and emits the
> "#hashtree=XYZ" suffix that you need to attach to the URL.
>
> The mutable-file tool would maintain an out-of-webroot file mapping
> pubkey to privkey. It would create a new keypair when run on a file that
> did not already have a .sig file, or would extract the old pubkey from
> an existing .sig file and look up the corresponding signing key. It
> would emit the #sigkey=XYZ suffix, and update or create the .sig file
> (next to the original data file) with the new signature.
>
> The encryption+immutable tool would take a file (from your source
> directory, which of course would *not* be under the webroot), produce
> the encrypted+hashed tahoe-like single-share output data, store it in
> the webroot under the storage-index name, and emit the URL.
>
> The encryption+mutable tool would do the same, taking the existing key
> from an adjoining .key file (or creating a new one), putting the
> signed+hashed+encrypted data in the webroot, and emitting the URL.
>
> == Automation ==
>
> Now, what's a good way to update all the container files? I.e., when you
> change your CSS and it gets a new hash, how should you update the .html
> file that references it? I've been using Git a lot recently, and it gave
> me an idea:
>
>  * store your website in Git or Mercurial (you *do* manage your website
>   in a revision control system, right? and the system you picked *does*
>   give you cryptographically-strong file-version identifiers, right?)
>
>  * use regular relative URLs in the .html files that you check in; web
>   authors remain unaware of the integrity-checking suffixes that gets
>   added later
>
>  * now build a tool that rewrites the HTML (and other containers, JS and
>   perhaps CSS) to replace the relative URLs with URL#hash=XYZ . The
>   tool runs at checkout time, when you deploy a new revision to the
>   webserver, or takes a git checkout (with all repository metadata) as
>   input and produces the webroot directories as output.
>
>  * The tool will build a table that says "bar.css has hash=XYZ" for
>   everything that gets checked out.
>
>  * take advantage of git's hash-of-data content-tracking properties to
>   cache the table that maps object to #hash=XYZ values: instead of "the
>   current version of bar.css has hash=XYZ", remember "version ABC of
>   bar.css will always have hash=XYZ".
>
>  * build a table that says "version ABC of foo.html references bar.css
>   and baz.js", to capture the object graph. Invert the table ("bar.css
>   is referenced by version ABC of foo.html, among others"). Now you can
>   quickly tell what files need rewriting when bar.css is modified. New
>   versions of foo.html get rescanned, added to the who-references-whom
>   table, then processed (hashed) and added to the whats-your-hash
>   table, then anyone who references it gets updated.
>
>  * keep careful track of containers (objects which reference other
>   objects). If bar.css imports booze.css, then while the original
>   contents of bar.css might not change, the annotated version (which
>   includes "booze.css#hash=XYZ") will change whenever booze.css
>   changes. The tables must reflect this, so that the updating scheme
>   will catch everything
>
>  * the last step should be a sanity check, walking through all the
>   output files, and comparing the #hash=XYZ values therein with the
>   actual hashes of the other output files.
>
>  * the generated tables can be used to alert you to immutable-reference
>   cycles, which are a no-no, and require mutability somewhere to break
>   the circle and turn the graph back into a strict DAG.
>
> Then, when you introduce mutability, you somehow mark the filenames that
> you want to be delivered as mutable (breaking cycles and reducing
> reference-updating effort, in exchange for possibly slowing down client
> fetch times). Then this rewriting tool will treat those files
> differently at checkout, creating (or updating) mutable objects for
> them. Other files which reference the mutable ones don't need to be
> updated when they change.
>
> When you introduce encryption, the same tool is used, except it dumps
> encrypted+hashed+(sometimes-)signed storage-index-named files into the
> output directory, instead of preserving the original filenames. The
> sanity-check would need to be given the readcaps (instead of working on
> the ciphertext, obviously), but would proceed the same way.
>
> The entire process could be automated to run each time you pushed a
> change to the production branch. Authors would be unaware of the process
> (except they'd get fewer complaints about http-used-in-https
> vulnerabilities). Web servers would be unaware of the process (they're
> just serving up weirdly-named files). End users (well, at least those
> who'd installed the plugin) would be mostly unaware of the process
> (they'd just see weird URLs in their status bar, but they're starting to
> get used to that anyways). If you stick with integrity (and not
> encryption), then end users with normal browsers are mostly unaware
> (they see the #hash=XYZ suffixes, if their status bar is wide enough).
>
> I've no idea how hard it would be to write this sort of plugin. But I'm
> pretty sure it's feasible, as would be the site-building tools. If
> firefox had this built-in, and web authors used it, what sorts of
> vulnerabilities would go away? What sorts of new applications could we
> build that would take advantage of this kind of security?
>
> thoughts?
>  -Brian
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
>