[tahoe-dev] [p2p-hackers] convergent encryption reconsidered

Sat Mar 22 07:08:00 UTC 2008

[apologies if this gets double-posted, I've been having email problems
today]

On Thu, 20 Mar 2008 21:56:58 -0700
Jim McCoy <mccoy at mad-scientist.com> wrote:

>  >   confirmation-of-a-file attack -- if the attacker already knows
>  >   the full plaintext of a file, then they can check whether a
>  >   given user has a copy of that file.  
> 
> The truth of this depends on implementation details, and is an  
> assertion that cannot be said to cover all or even most of the  
> potential use-cases for this technique.  

For concreteness (and for the benefit of the readers who aren't familiar with
Tahoe), I'd like to throw in a few notes about how Tahoe does things.

For each immutable file that gets uploaded, the first step is to decide upon
an AES encryption key. One option is to simply pick a random 128-bit number.
Another is to hash the file. Tahoe 0.9.0 uses a tagged SHA-256d hash of the
file. (every different use of a hash function in Tahoe gets a distinct tag..
in this case, the tag is the netstring of
"allmydata_immutable_content_to_key_v1+", plus a string that encapsulates (k,
N, segment_size), to make sure that alternate encodings of the same file
result in distinct versions. See [1] for more details.).

Regardless of how the encryption key is generated, the key is hashed to
create the "Storage Index", or "SI". This storage index has two purposes: to
determine the set of servers on which the shares will be placed, and to index
the shares on those servers. When using CHK, the SI is a second-order hash of
the plaintext. When not using CHK, the SI is a hash of a random string.

The storage servers are basically just remote tables that map Storage Index
to a chunk of data. Any client with storage authority (i.e. an account of
some sort) can add new entries to this table by uploading a share. Any client
can request the share associated with a given index.

So when convergence is in use, any attacker who gets to see the storage index
is effectively seeing a hash of the file's plaintext.. not an exact hash, but
a value which is derived from the plaintext with all the same properties of
your normal cryptographic hash function.

Who gets to see a storage index and correlate it to a given user? The SI is a
parameter in all share upload and download requests, and it is present inside
the "verifier-cap" and "repair-cap" values (which can be transferred to
delegate verification or repair duties). If the user's client machine
performs any of these requests directly, the machine at the other end will
see the correlation.

The links between the client and the storage servers are encrypted (since
those are Foolscap connections), so an observer watching those links will not
learn the SI values. However the storage servers themselves can see anything.
Clients which are uploading files will first ask at least N servers whether
they already have the share in question: those servers will learn that the
client holds the given file and wants to publish it. Clients who are trying
to download the file will ask at least k servers if they have the share:
those servers will learn that the client desires the given file. So an
adversary who manages to provide or compromise more than a handful of storage
servers will be able to observe the storage index and client identity for
every read or write operation performed in the grid. Running just one server
is enough to see a significant portion of the requests.

Our conservative design treats the storage servers as part of the threat. We
expect these servers to return data that was previously uploaded, nothing
more: the security properties of the system must hold even if the servers
corrupt or publish the shares. We assume that everything on the storage
servers is public knowledge. (Note that our *availability* properties depend
upon enough of the servers returning their shares uncorrupted, but our
*security* properties do not).

Granted, in a commercially-run storage grid, clients will only be using
servers that the company provides, and we do not write storage servers that
corrupt or publish the shares they are storing. But in most other use cases,
we expect that it will be fairly easy for a potential eavesdropper to bribe
users into revealing the storage index values that they are interested in:
just offer them storage space :-).

Also note that we record the exact file size inside each share, so that the
ciphertext file can be reconstructed and verified (and repaired) using just
the information from a single share. This reduces the guessing space
considerably. For large files, the file size may be fairly unique, so it may
be possible to perform correlation even without a hash of the contents.

Finally, for accounting/quota purposes the storage servers record an account
number in each lease on each share. This number only needs to be correlated
between servers by some system-wide accounting server, so that it can
determine that Alice is using 5GB of space because she has 1GB of shares on
server 1, 2GB on server 2, and 2GB on server 3 (and bill her accordingly).
The account number could be hashed with the serverid and made different on
each client, but this would increase the amount of work that the accounting
server must do (or store), an would make certain delegations more difficult
(like the Upload Helper).

> This property only holds if it is possible for the attacker to link a
> selected ciphertext/file to a user. Systems which use convergent encryption
> to populate a shared storage pool _might_ have this property, but is my no
> means a certainty; if a system is implemented correctly is is not necessary
> for users to expose their list of files in order to maintain this shared
> storage space.  

At the PET Workshop last year, I spent some time brainstorming about what we
would do if we were trying to offer anonymity (which we aren't). I believe
that turning off convergence is a first requirement. Hashing the SI with the
serverid to create a per-server index would make it more difficult for a
server to correlate one of their shares against shares on a different
machine. Adding padding or rounding up the filesize to some convenient
boundary (Zooko suggested the segment size, probably 128KiB for us) would
make filesize less of a distinguisher.

But these have their costs. Per-server index values would make repair more
difficult: servers wouldn't have enough information to perform repair on
their own, so the original client would need to be involved in the repair
process. Padding would consume more storage space. For Tahoe, we value
reliability and simplicity over anonymity, so we haven't pursued these
approaches.

> To be a bit more specific, this is really just a version of a standard
> dictionary attack. The solution to this problem is to look at similar
> systems that suffered from dictionary attacks an see what solutions were
> created to solve the problem.  

Yeah, exactly. Most of the solutions I'm aware of are to add a random salt,
and thus break convergence.

> The problem with this imagined attack are twofold...  

Unfortunately the way we're using Storage Index values in Tahoe exposes a lot
more information to the attacker than the mnet design did (and in this case
we're imagining the storage servers to be the attacker). Rather than storing
FEC-output blocks according to their hash, Tahoe is storing shares according
to their SI. So the correlation job is much easier: a pair of hashes at
worst.

> Assuming an all-seeing oracle who can watch every bit sent into the  
> storage pool will get us around this first problem, but it does raise  
> the bar for potential attackers.  

True. In Tahoe, someone who isn't running a storage server (or a Helper or a
Repairer) won't be able to see the storage index, so they'll have a lot less
information to work with. But in a distributed system, you don't want your
security properties to depend upon this.

>  >  Defense Against Both Attacks
>  >
>  >   [...]  
> 
> Bad idea, for a couple of reasons.  This first problem is that you are  
> assuming the added secret is adding a significant amount of entropy  
> and the second is that you are throwing out the advantage of  
> convergent encryption.  

Towards the first problem, the secret that we hash in is required to be a
random string at least as long as the AES key. So by definition it is adding
enough entropy to bring the unguessability back up to the keysize limit.

Towards the second problem, yeah, that's the tradeoff :). Per-client or
per-account convergence retains some of the benefits without exposing the
low-entropy files to attack by folks who don't know your mixin secret.

> If the secret is shared across multiple files then I can run a
> known-plaintext attack on your system using a file that I assume I have in
> common with my target (e.g. a standard, small, OS file) to get their
> added_secret and then once I know my target's secret I move on to the file
> I really want to go after.  

I think this is only true if the secret is a user-generated password. Since
we plan to use a long machine-generated random string, I believe the
known-plaintext attack becomes as hard as guessing that string, and we can
make that as hard as we want. This secret would be generated by the client
the first time it runs, and kept in a private file. To share it between
accounts, some out-of-band mechanism is needed to make sure all clients get
the same value, but by no means will it be created by a human's low-entropy
brain :-).

> A better solution would be to look at something like bcrypt() (see "A  
> Future-Adaptable Password Scheme") and use this mechanism for files  
> below a certain threshold.  

Interesting.. I'll read up on that.

> 	-Personal documents
> 	-Photos
> 	-"Home" movies
> 
> The size of the first is negligible.  The second is large and getting  
> larger, and the third is probably a lot smaller than most of us  
> think.  Photos are really the only one that will skew this result.  

True, although I'm more concerned about the small files these days since I
started thinking about the per-file storage overhead (Tahoe stores a lot of
hashes along with the share data, and ext3 minimum block sizes are 4KiB on
large disks). I hope to include this overhead information in the analysis
that we're doing with the allmydata userset to get a feel for how negligible
the small files really are.

There are a lot of factors that I can imagine influencing the value of
convergence. Home movies are likely to be less-replicated. Downloaded media
is likely to be more-replicated between groups of friends (who influence each
other's tastes) than between unrelated people: if use of Tahoe spreads from
one friend to another, we might see more replication than if it were promoted
by more traditional broadcast advertisement. There may be smaller files that
are so heavily replicated that the overhead on them would make them worth
converging.

The problem that concerned us is that the guessability/entropy/sensitivity of
the file is not entirely correlated with size. Large files are probably
media, media is probably not sensitive and probably shared. But we can
imagine files which are large, sensitive, and low-entropy, all at the same
time. I'd be a bit nervous about deploying a heuristic that retained a chance
of exposing certain user files to guessing attacks.

Once we've done some analysis and made an estimate on the benefits of
convergence, we'll be in a better position to make a decision about the
tradeoffs. I suspect that we'll wind up with a threshold that says we use
cross-client convergence for any file that's bigger than 2MB, and only
per-client convergence below that, and then try to make it clear to users
what the concern is so they can click the "make it more secure" button for
their large+guessable+sensitive files. But I want to know what the space
savings is (and who really benefits from it) before I make that decision.

thanks,
 -Brian

[1]: http://allmydata.org/trac/tahoe/browser/src/allmydata/util/hashutil.py