[tahoe-dev] UEB hash size

Shawn Willden shawn-tahoe at willden.org
Mon Jul 13 03:51:19 UTC 2009


On Sunday 12 July 2009 08:50:43 pm Zooko Wilcox-O'Hearn wrote:
> By the way, I'm sitting on a good idea that I haven't finished
> writing up yet for how to combine the encryption key and the
> integrity-checking hash together so that you have only one value
> (perhaps of size 256 bits) instead of two values -- one for the key
> and one for the hash.

I can see several ways to do that, and it makes a lot of sense, particularly 
since in the case of a content-hash-key, the key is another integrity 
verifier, as well as a decryption key.  Not so in the case of a random key, 
of course.  Given a content-hash-key, I think the combined key and UEB hash 
could be even shorter than 256 bits.  192 bits would give you strong 
integrity assurance (since all of the 192 bits are used for that), as well as 
strong encryption (128 bits).

> Perhaps that would solve most of your 
> performance issues?  As I mentioned in my previous mail, I'd like to
> understand more about what the performance implications are in
> GridBackup.

Me too.  So far, I just know that read cap index uploading is significantly 
slowing backups.  Most of this is due to the fact that the burst trie 
structure consists of a large number of files, and they're mutable, which 
makes them slower.  I chose to make them mutable because they do mutate, but 
perhaps it would help to make them immutable.  If I could reduce the size of 
an index entry by 30%, I could reduce the need to update trie nodes by the 
same amount.

For the moment I've switched to uploading the indices as a set of sorted flat 
files, and I just have to binary search each of them in turn when looking for 
a specific cap.  That makes uploading fast and searching slow (somewhat; the 
files are cached locally, so the search is performed locally).  I think after 
I've accumulated a few hundred thousand entries in the flat files, I'll then 
push them all into the trie.  Bulk updates of the trie are fairly quick on a 
per-entry basis.

In general, I view the read cap index as overhead and I'd like to minimize it.  
Currently, the CHK URI portion of each record consumes 60 of 78 total bytes, 
of which only the 16-byte AES key is really essential for retrieval.

> > What is the bare minimum data needed to retrieve, reassemble and
> > decrypt an immutable file?  Just the AES read key?
>
> That, and some way to find the shares, which we currently call the
> "storage index".

But the SID is derived from the read key, so that's implied.

> That would omit not only the integrity check on the 
> ciphertext (to guarantee that the immutable cap you started with
> could match only one file).

Well, if you regenerated the read key from the plaintext and compared with the 
original read key, you'd know that it matches with very high probability.  
The exact probability depends on the number of files in the system, but even 
with trillions of files, the odds of a false match are on the order of 
winning Powerball twice in a row.

> but also the integrity check on the shares 
> (to identify which servers are responsible for serving up corrupted
> shares, in the case that the resulting file was corrupted).

If the shares have different UEBs, it tells you which of them is wrong.  In 
the case of any sort of random corruption, you'd be able to tell that anyway.  
If the UEBs are the same, then the UEB contains the information to tell you 
which share is corrupted.

Assuming the read key is the primary integrity verification mechanism (and I 
know that's not the assumption, but it could be, and it works), then the 
birthday paradox is irrelevant with respect to the UEB hash, so it only needs 
to be long enough to resist an attacker's attempt to produce an 
otherwise-valid UEB with the same abbreviated hash, i.e. a second pre-image 
attack against a tagged SHA-256d hash (abbreviated).

	Shawn.



More information about the tahoe-dev mailing list