[tahoe-dev] proposal for an HTTP-based storage protocol

Brian Warner warner at lothar.com
Wed Sep 29 01:44:53 UTC 2010


>>> Do we actually need server-side verification of data? We already let
>>> clients upload whatever they want to servers, as long as it's
>>> properly formatted as a share.
>>
>> Yes, but they can't upload something that looks like a share of file
>> A but actually contains some other content (unless they find a hash
>> collision).
>
> Ah! So if I'm understanding correctly: a client could try to overwrite
> a share for an existing file, and the server currently prevents this
> by verifying that the share is for the correct file? And if the server
> doesn't verify the share, there's the possibility of a DoS if a
> malicious client overwrites all shares for a given file?

Yeah, the server's job is to only accept certain writes. If any client
could overwrite any share, then one client could destroy other client's
data. (a smaller attack is that if any client could create any new
share, at an arbitrary storage-index, then they might create dummy
shares in exactly the same places as the victim was about to upload
their own shares, preventing those good shares from being written and
effectively DoSing the overall upload: we call this the "roadblock"
attack).

The simplest way to do this, and what we do in Tahoe's current
storage-server protocol, is to allow any writes of new immutable shares,
but not allow overwrites of existing immutable shares. For mutable
shares, each one is created with a shared-secret "write-enabler", which
must be presented with all subsequent overwrite requests.

> Hrmm. I was hoping that the server could be treated as a dumb file
> store as much as possible, to simplify things.

Yeah, that's a great goal, and it's what drove the current protocol. The
current server knows whether a given share is mutable or immutable, but
nothing further. This decoupling is why three-year-old servers from
v1.0.0 still work just fine with v1.8.0 clients. It's also why the
storage-servers use hardly any CPU. Dumb servers are awesome.

I like this goal and would really like to keep it. On the other hand,
there are some vaguely-compelling arguments for making the server more
aware of (and thus more tightly coupled to) the share format:

 1: invalid data would be prohibited early, before it consumed any
    storage space, and before downloading clients tried to look at it,
    which might improve overall reliability or performance
 2: servers could perform file-verification locally, looking for
    bitflips in their local shares, possibly triggering repair
    independently of the client who cares about that data
 3: Accounting could be implemented without requiring confidentiality in
    the share-transfer protocol
 4: mutable-file updates could be implemented without a write-enabler,
    thus removing the confidentiality requirement from the protocol

The last two points need some explanation. With the current
Foolscap-based share-upload protocol (which keeps all messages secret),
Accounting could be implemented simply by presenting a per-server secret
with the write request. The server would check this secret against a
table to see whether the write was acceptable or not (according to the
accounting rules: the secret tells you that the share is from Bob, and
Bob is allowed to store 5GB on this server). Similarly, mutable-share
updates are presented with a shared secret. Both require a
confidentiality-preserving channel, otherwise an eavesdropper can steal
the secret and use it for their own shares.

The accounting scheme we have in mind involves ECDSA keypairs and signed
requests: the fact that the request is signed by privkey123 means that
it's from Bob. The signature must cover the entire message, otherwise an
attacker could take Bob's message and chop it up, replacing pieces of it
with her own data, thus effectively stealing Bob's storage space. But
the message is big (the whole share, potentially several GB), and
sending the signature at the very end would be a drag (should the server
buffer the whole share in the hopes that it will see a valid signature
at the end?). If the server has more information about the share format,
we'll get more flexibility in breaking up the share-upload messages and
still validating each piece.

The mutable-share update protocol's write-enabler could be replaced with
keypair: the share is created with a pubkey, and all requests to change
it must be signed by the corresponding privkey. But the original message
that established the write-enabling-pubkey could be subverted, replacing
it with a different pubkey. We'd like the server to have a way to know
which pubkey is the "right" one that is stronger than simply being told
so by the first caller. So we'd kind of like the fundamental identity of
the share (the storage-index) to be derived from that pubkey. And we'd
like that pubkey to be the same as the one that's used to sign the
actual contents of the file (i.e. the one derived from the readcap, for
which the privkey is derived from the writecap), so that there's no
possibility of having a share be stored in the wrong slot (what happens
when an attacker presents you valid share XYZ in a message that says to
please write this into slot ABC? how can you detect the mismatch?).

This all leads to a desire for a system in which the storage-index for
any given share captures the root of a collection of hashes and
signatures, such that it is hard for an attacker to submit a "roadblock"
share for a given storage-index that would prevent the "real" share from
being accepted later. In this scheme, each write (either creating a new
share, or modifying an existing mutable share) just contains the full
share that you want to see on the server's disk. The server then looks
carefully at it and validates all the same hashes and signatures that
the Verifier or the eventual downloader would do, and if they all match
(and the computed storage-index matches the slot being written to), the
share is saved to disk.


Anyways, the big issue is that we've got all this encryption and
erasure-coding in the way. If we were merely doing something immutable
like Git and storing each object under the hash of its contents, then
the upload+verification process is really simple and robust (tell me the
hash of your contents, then give me the contents, and I'll store them if
and only if the hash matches). But since we've got some extra levels in
the way, the server would need to know more about the format to be able
to verify everything. And if it can't verify everything, then there are
more opportunities for an attacker to inject something bad.

cheers,
 -Brian



More information about the tahoe-dev mailing list