[tahoe-dev] Bringing Tahoe ideas to HTTP

Mon Oct 5 05:47:56 UTC 2009

(If you missed the start of this thread, look here:
http://allmydata.org/pipermail/tahoe-dev/2009-August/002724.html and
http://allmydata.org/pipermail/tahoe-dev/2009-September/002804.html)

After a chat with Tyler Close, a few more ideas came to mind. He pointed
out that a stubborn problem with web applications these days is that the
HTTP browser caches are not doing as much good as developers expect.
Despite frequently-used things like "jquery.js" being cacheable, a
substantial portion of clients are showing up to the server
empty-handed, and must re-download the library on pretty much every
visit. One theory is that the browser caches are too small, and pretty
much everything is marked as being cacheable, so it's just simple
cache-thrashing. There's no way for the server to say "I'll be telling
you to load jquery.js a lot, so prioritize it above everything else, and
keep it in cache as long as you can".

And, despite hundreds of sites all using the exact same copy of
jquery.js, there's no way to share those cached copies, increasing the
cache pressure even more. Google is encouraging all web developers to
pull jquery.js from a google.com server, to reduce the pressure, but of
course that puts you at their mercy from a security point of view: they
(plus everyone else that can meddle with your bits on the wire) can
inject arbitrary code into that copy of jquery.js, and compromise
millions of secure pages.

So the first idea is that the earlier "#hash=XYZ" URL annotation could
be considered as a performance-improving feature. Basically the
browser's cache would have an additional index using the "XYZ" secure
hash (and *not* the hostname or full URL) as the key. Any fetch that
occurs with this same XYZ annotation could be served from the local
cache, without touching the network. As long as the previously-described
rules were followed (downloads of a #hash=XYZ URL are validated against
the hash, and rejected on mismatch), then the cache could only be
populated with validated files, and this would be a safe way to share
common files between sites.

The second idea involves some of the capability-security work,
specifically Mark Miller's "Caja" group which has developed a secure
subset of JavaScript. Part of the capability world's efforts are to talk
about different properties that a given object has, as determined by
mechanical auditing of the code that implements that object. One of
these properties is called "DeepFrozen", which basically means that the
object has no mutable state and has no access to mutable state. If Alice
and Bob (who are both bits of code, in this example) share access to a
DeepFrozen object, there's no way for one of them to affect the other
through that object: they might as well have two identical independent
copies of the same object. The common "memoization" technique depends
upon the function being optimized to be DeepFrozen, to make sure that it
will always produce the same output for any given input.

(note that this doesn't mean that the object can't create, say, a
mutable array and manipulate it while the function runs.. it just means
that it can't retain that array from one invocation to the next)

So the second idea is that, if your #hash=XYZ-validated jquery.js
library can be proven to be DeepFrozen (say, by passing it through the
Caja verifier with a flag that says "only accept DeepFrozen classes" or
something), then not only can you cache the javascript source code, but
you can also cache the parse tree, saving you the time and memory needed
to re-parse and evaluate the same source code on every single page load.

(incidentally, it is quite likely that jquery.js would pass a DeepFrozen
auditor, or could be made to do so fairly easily: anything that's
written in a functional style will avoid using much mutable state)

This requires both the DeepFrozen property (which makes it safe to share
the parsed data structure) and the #hash=XYZ validation (which insures
that the data structure was really generated from the right JS source
code).

I know that one of the Caja goals is to eventually get the verifier
functionality into the browser, since that's where it can do the most
good. If that happened, then the performance improvements to be had by
writing verifiable code and using strong URL references could be used as
a carrot to draw developers into using these new techniques.

thoughts?

cheers,
 -Brian