[tahoe-dev] Random access to files

Brian Warner warner-tahoe at allmydata.com
Thu Jan 29 02:30:59 UTC 2009


> > Not impossible, but not trivial.
> 
> Dang.

As Zooko pointed out, I'm probably talking about a larger project than you
need.. if you're willing to tolerate a larger amount of overhead per read,
then the truncated-list-of-segments approach might be good enough.

> Hmm.  What I want to do is binary search a sorted file.  Assuming
> caching in the node, this is a small win, but only a small one.

Yeah, that would benefit more from the larger project I described.

> A single segment?  Or a single bucket?  Is this what mutable.txt
> refers to when it says:
> 
>     "i.e. the write enabler is associated with the "bucket", rather
>      than the individual shares."

Nope, that's something different. More below..

> Let me see if I have the terminology straight here (maybe I should
> start a glossary.txt file!):
> 
> * slot:     A persistent reference to a mutable file
> 
> * segment:  A chunk of an encrypted file.  Segments are the units for
>             erasure coding.

Yes and yes.

> * share:    The output of erasure coding is a set of N shares, k of
>             which are needed to reconstruct the file.  A share is also
>             the unit of upload/download.

Yes, but there's the segmentation wrinkle. Look at docs/file-encoding2.svg .
The ciphertext is split into segments, then each segment is erasure-coded
into a set of N "blocks". The first share consists of the first block from each
segment (along with hash trees, etc). The second share consists of the second
block from each segment, etc.

Unfortunately, we aren't very consistent about our use of the word "block".
But the "share" is the sequence of opaque bytes that is handed to the storage
server (and eventually back to the client), and the "block" is a piece of the
share that's designated for unzfec'ing into a specific segment of ciphertext.
Once we've located the share and grabbed some hashes, we fetch one block at a
time.

> * bucket:   A set of shares, each from a different segment of a single
>             file, that is stored on a given server.  A bucket is
>             usually uploaded and downloaded as a unit, though I don't
>             believe that's necessary.

Nope. A "bucket" is a directory on the storage server, containing one or more
shares for the same file. Each bucket is labeled with a storage index, then
within the bucket, each share is labeled with a share number. When a client
wants to download a file, it sends the storage index to the server and gets
back a set of shares, all of which happen to be stored in the same bucket.

Much of the metadata associated with a share is really supposed to be
associated with a bucket (i.e., the lease timers and accounting information
should be the same for all shares on a given server, so that stuff would be
more appropriately attached to the bucket than to the individual shares. See
#600 for some more details). One of the bits of metadata on a mutable share
is the "write enabler", a shared secret which governs the authority to mutate
the share (write-cap holders can derive the write enabler, read-cap holders
cannot).

>     "i.e. the write enabler is associated with the "bucket", rather
>      than the individual shares."

The write-enabler is the same for each share. So, by the logic above, it
should be attached to the bucket, rather than the individual share. That's
what this comment in mutable.txt is referring to.

> > I'm not sure how deeply you'd like to get into this.
> 
> Hmm.  Not that deeply, not now anyway.

I don't blame you :).


> > Reducing the number of segments that get fetched is basically enough
> > to reduce the amount of work done for a single random-access
> > read. You could reduce it further by fetching a subset of the hash
> > tree: to validate one segment, you only need to fetch log2(Nsegs)
> > hashes, but since the current code assumes it's going to download
> > the whole thing, it fetches all (Nsegs) of them ahead of time.
> 
> I don't think I see the issue here.  Each segment (rather, the shares
> of each segment) contains all the hash tree nodes needed to verify
> that segment back to the root, right?

Each share contains all the hash tree nodes needed to validate every block in
that share back up to the root, yes. But you don't need to retrieve all of
those nodes in order to validate a single segment of ciphertext. Our current
download algorithm retrieves all the hashes, and (particularly for large
files or long RTTs) that's a significant chunk of overhead. A more-clever
algorithm would only download the minimal required set.

Take a look at src/allmydata/immutable/layout.py, and also at
docs/file-encoding[3456].svg . Each share contains a data structure that
looks roughly like:

 blocks[NSEGS] # called "data" in layout.py
 crypttext_hashtree_nodes[ceil_pow2(NSEGS)*2-1]
 block_hashtree_nodes[ceil_pow2(NSEGS)*2-1]
 share_hashtree_chain[log2(ceil_pow2(NSHARES))]
 uri_extension:
  crypttext_hashtree_root
  share_hashtree_root

Ignore crypttext_hashtree_nodes and crypttext_hashtree_root. (they form a
Merkle tree over the ciphertext segments, and are used for extra verification
after decoding).

block_hashtree_nodes[] contain the complete Merkle hash tree for which
blocks[] are the leaves. If you take the root of each block hash tree (one
per share) and put them into a list, that list forms the leaves of the "share
hash tree". The minimum necessary Merkle hash chain from this particular
share up to the root is stored in "share_hashtree_chain". The root of the
"share hash tree" is stored inside the uri_extension as
"share_hashtree_root". The uri_extension is the same in every share, and its
hash (the "UEB hash") is placed in the CHK read-cap.

So, if you want to obtain a single segment of ciphertext, you need to fetch
"k" validated blocks. To validate a single block, you need:

 the block data [blocknum=segnum, sharenum]
 the block hash tree Merkle chain for it [blocknum, sharenum]
 the block hash tree root [sharenum]
 the share hash tree Merkle chain for that share [sharenum]
 the share hash tree root (in the UEB)
 the UEB
 the UEB hash (in the read-cap)

The last three are the same for all shares, and we grab the UEB and the
share-hash-tree-root from the first server that we see. The
share-hash-tree-Merkle-chain and the block-hash-tree-root that we need
depends upon which share we're looking at, so we have separate classes to
manage each share and hold these values. (note that these chains overlap, so
if we manage to get the chain for share#0 before we decide that we want
share#1, we can pull down fewer share-hash-tree nodes, but in practice this
rarely wins us much, since we want to grab shares in parallel).

It's the last two items that are per (share+block). If you're grabbing
block#0 (which is one of the FEC components for seg#0) out of a file with 8
segments, you only need to fetch three nodes of the block-hash-tree (using
the picture in allmydata.hashtree.CompleteBinaryTreeMixin, you'd need hashes
number 8, 4, and 2). If you've already validated block#0 and want to grab
block#1, you don't need to fetch any new hashes at all (you already have 4
and 2, and you've computed hash 7 yourself, by hashing block#0). If you then
want to validate block#2, you only need hash number 10. Reading the file in
linear order, you end up fetching a total of NSEGS hashes.


So, that's a long winded way of saying that our current download code could
be smarter, and that the quick truncate-the-list-of-fetched-segments approach
will still be fetching 32*ceil_pow2(NSEGS)*2 bytes of block hash tree nodes
each time, even if you're only downloading a single segment. Divide this by
your bandwidth, add to it the computation time to validate the whole hash
tree, and that gets you the per-read overhead cost.


> You could go either way there.  The caching can either be in the
> client or in the server.  The truly naive clients will be those that
> don't even know they're using a DFS -- normal programs reading through
> a FUSE layer, for example.  But in that case, the FUSE layer is the
> real client and it knows to be smarter.

Yeah, we go back and forth on that. The way it usually works is, I suggest an
API that should be maximally convenient for some particular use case that I
have in mind, frequently involving caching behind the scenes or automatic
retry of failures, or something. Then Zooko points out that there are plenty
of other use cases, and that an API which is maximally predictable and easy
to explain (and understand) is best, even if it appears less convenient (like
when it exposes more failure modes to the caller). I usually go ahead and
implement my API anyways, and then a few months later realize that I was
wrong and quietly replace it with whatever Zooko suggested in the first
place. It's an iterative process :-).


cheers,
 -Brian



More information about the tahoe-dev mailing list