[tahoe-dev] blocks instead of files?

Mon Mar 15 02:27:56 UTC 2010

Zooko O'Whielacronx wrote:
> 
> Maybe we shouldn't be trying to optimize too much for these sorts of
> questions. When Brian is describing the difference between Mountain
> View and Tahoe-LAFS, he often starts by saying that Tahoe-LAFS stores
> entire shares (all the blocks of the share) together in one place so
> that the uploader, downloader, and server have fewer separate objects
> to keep track of. Maybe that's the best reason to keep doing it the
> way we're doing it now.

A bit more background on this: the MV/Mnet/MojoNation approach was to
split files into fixed-size segments (I believe the largest segsize was
2MiB, or maybe 16MiB, but I'm probably wrong). Then each segment was
erasure-coded into blocks (3x was typical expansion), and each block was
uploaded completely independently to a couple servers (I think 4x was
typical replication).

For large files, this resulted in a large number of encoded blocks to
keep track of. A 200MB file could result in 100 segments, 300 blocks,
and 1200 copies of blocks.

The storage servers managed their blocks by aggregating same-sized
blocks into big slabfiles, and using a small database to map blockid to
(slabfile, offset). The number of rows in this database grew so large
that the database index could no longer be held in RAM, at which point a
basic DYHB ("Do You Have Block?") query required a disk seek, dropping
the maximum DYHB acceptance rate to about 100Hz. And because the seek
occurred in a blocking read() call, the server could do nothing else
(like serve data) while waiting for the result.

Worse yet, because a MV client needed to ask about 1200 objects for each
file download, the rate at which it was sending DYHB queries was high
too, making that 100Hz limit pretty troubling.

Tahoe's share-based approach (as opposed to MV's block-based approach)
means that each share results in exactly one object for a storage server
to manage. The default 3-of-10 encoding means each file creates 10
objects that need to be managed. Where a 200MB file could result in 1200
MV objects, in Tahoe it turns into 10 objects. And where MV would need
to issue 1200 DYHB queries, Tahoe only emits 10 (and could get away with
3 if it were feeling lucky).

This 100-fold reduction in the number of things that a storage server
must keep track of was the biggest reason we built Tahoe. The DB lookup
performance was killing us. In Tahoe, the number of items being tracked
(which tends to be 1 to 10 million for a full 1TB drive) is small enough
that we don't even use a DB. Instead, we just use the filesystem
(tahoe's dead-simple storage/shares/$PREFIX/$SI/$SHNUM scheme). So far,
linux+ext3 seem to do a decent job of caching that much structure in
RAM. (although I think it's been a looong time since we did any
server-performance testing, like how many get_bucket() calls per second
we can make).

cheers,
 -Brian