[tahoe-dev] Question about convergence keys

Brian Warner warner-tahoe at allmydata.com
Wed Aug 13 00:59:22 UTC 2008

> He mentioned Tahoe to me, which I hadn't come across before. And, as it
> happens, it solves a large number of the problems we'd like to solve.


> I have a question about convergence keys. My understanding is that the
> files are encrypted with their own hash, which means that two copies of the
> file will encrypt to the same thing, but unless you actually have the file
> itself you can't see the content.

That's correct. In Tahoe, if Alice and Bob each have a copy of the same file
foo.txt, and if they are each using the same convergence secret, then they
will both wind up using the same encryption key for the file, and they will
upload the same encrypted data. In Tahoe terms, they will both get the same
"read-cap": a relatively short string that is both necessary and sufficient
to recover the plaintext, which is used as a secure reference to the file.

> Traditionally, a hash gives no information about the content of a file, so
> posting the hash of a confidential file doesn't tell anyone anything they
> didn't already know.

Almost. To be precise, it gives you a tiny bit of data, equal to the number
of bits in the hash (i.e. a cryptographic hash is not a zero-knowledge
proof). This information could be used to tell if the file being hashed is
the same as a file that you're holding locally. If the file is not very
random (i.e. it is small or mostly predictable), then knowing the hash will
reduce the set of possibilities, which might make it possible to guess what
is inside the file.

We call this the "partial-information guessing attack", and it was this
concern that prompted us to add the convergence secret. Mixing a
suitably-random convergence secret into the hash serves to remove the
remaining information from the resulting hash. People who know the
convergence secret will know 256 bits about your file, people who don't know
the secret will know zero bits about your file.

> Tahoe allows you to set a convergence key to add to the hash, but if you
> have a group of relatively trustworthy peers (the friendnet scenario), then
> you want everyone to have the same convergence key, and the null key is the
> easiest to agree on.

However, the null key is pretty guessable, so you're effectively allowing the
whole world to participate in your "convergence domain".

As zooko described elsewhere, the convergence domain is the set of people
with whom you share two properties:

 1: your uploads will converge with theirs, allowing you to save backend
    storage space and bandwidth when uploading identical files
 2: the other people will be able to mount a partial-information guessing
    attack against your files: the public information about your uploaded
    file (like the storage index)[1] will reduce the work they need to do

> Am I misunderstanding something. Is the default convergence key something
> other than a plain hash of the file? It would seem pretty easy to compute
> H(file) to get the hash, and H(file+some_zero_padding) to generate a
> convergence key.

The encryption key is effectively H(convergence_secret+file). [2]. The
convergence_secret is a random string that the Tahoe client creates when it
starts for the first time (or which you can override by writing whatever
base32-encoded string you want into BASEDIR/private/convergence). If you want
to share a convergence domain with your friends, just make sure you are all
using the same BASEDIR/private/convergence.

Also note that convergence is not necessarily as big a win as you might want.
If both Alice and Bob have a bunch of identical files on their disk and are
uploading them, then yeah, but in some quick tests on allmydata customer data
we found the space savings to be less than 1%. You might want to do some
tests first (hash all your files, have your friends do the same, measure the
overlap) before worrying about sharing convergence secrets.

hope that helps,

[1]: the "storage index", for immutable files, is a hash of the encryption
     key, and is used to figure out which servers should hold the shares, as
     well as used as an index on those servers to reference the shares. As a
     result, the storage index for any file is essentially public
     information, since every storage server will get to see it.

[2]: the actual specification is in allmydata/util/hashutil.py:132, in the
     convergence_hash() function, and is:

      t = "allmydata_immutable_content_to_key_with_added_secret_v1+"
      t += netstring(convergence_secret)
      t += netstring("%d,%d,%d" % (k,N,segsize))
      return SHA256d(netstring(t) + file)

     We use netstrings and SHA256d (instead of plain SHA256) to avoid "chosen
     protocol attacks", which would allow two different files to wind up with
     the same hash.

More information about the tahoe-dev mailing list