[tahoe-dev] Choice of tree-hash

CodesInChaos codesinchaos at gmail.com
Fri Sep 21 11:02:02 UTC 2012


> The storage overhead is always linear with the number of leaves, so
> small leaves result in a higher overhead.

You don't need to store the tree down to the leaves. You can truncate
it at your chosen segment size. I'm not advocating 1 KiB segments, I'm
only for small constant sized leaves. This makes the resulting hash
independent from the chosen segment size, which is an important
property of a universal tree-hash.

> Tahoe doesn't prepend anything to the file before hashing.. I'd be
> worried about people accidentally corrupting files by not adding or
> removing the right data at the right time. Similarly, plaintext length
> is transmitted elsewhere (in an authenticated location: either included
> in or covered by the filecap).

I don't suggest prepending those to the file. My suggestion is that
when you hashed your tree to its root, you hash that root again,
together with an artificial second leaf consisting of the first 24
bytes of the file and the 8 byte file length.

This additional information would be stored in the same place you
store the hashtree. A hashtree of depth n has 2^n-1 nodes, and thus a
size of 32*(2^n-1) bytes. When you store those additional 32 bytes I
advocate next to the tree, it will have a size of 32*2^n, making it a
power of two. It also means that each "layer" of the tree will start
aligned to a power of two equal to the size of the tree. This is
convenient when you store an encrypted hashtree(for example of the
plaintext), since now each layer of the hashtree can be addressed in a
hashtree of the ciphertext.

In tahoe I'd advocate including the top 64 bytes of the hash tree in
the UEB (encrypted with the read cap obviously). That way you can get
both file type(in many cases) and size without downloading the actual
file.

> This sounds great! I was actually trying to avoid any kind of hash of the plaintext in my system, but that said, an encrypted hash used to detect bugs in the encryption protocol would definitely make sense.

This hash doesn't only detect bugs. It identifies the file. So you can
easily see that a file stored on tahoe is the same one as a file
stored on cryptosphere. (Obviously assuming knownledge of read caps)



More information about the tahoe-dev mailing list