[tahoe-dev] Measuring the benefits of convergence

Brian Warner warner-tahoe at allmydata.com
Fri Mar 21 03:34:51 UTC 2008

Rob Kinninmont has been working to analyze some data that we have on the
existing allmydata userbase. While we were going over the tools that he's
building to do this, we realized that it is important to take constant
factors into account.

Specifically, there is a non-zero amount of overhead for bookkeeping,
integrity checking, versioning, and data formatting. In the current version
of Tahoe, each share has roughly 781 bytes of overhead for this purpose.

Disk block size is even more important. Tahoe currently uses a very simple
storage format that puts each share into a separate file. (eventually we may
move to a more efficient one, but that is likely to come at the cost of
reliability and ease-of-maintenance). The large disks that we use as storage
servers usually have large block sizes (to reduce the size of the
block-allocation bitmap). ext3 will use 4KiB block sizes once the overall
filesystem size goes above a hundred GB or so. As a result, even the smallest
of shares will use at least 4KiB of disk space. To improve this, we must
either switch to a filesystem that packs small files together (like reiser4,
I believe), or do packing ourselves. The way we store shares in tahoe also
includes a separate directory for each storage index, which hits us with
another disk block. These overheads are multiplied by the total number of
shares we store.

If you're interested, the misc/storage-overhead.py script in the Tahoe source
distribution will tell you how much storage space is consumed for files of
various sizes. The unsurprising summary is that small files have a lot of
overhead associated with them.

What this means for convergence is that relatively small files which are
highly shared can represent a greater potential space savings if you pay
attention to the overhead as well as the raw data size. Any analysis which
tries to estimate how much of a win might be represented by convergence needs
to take this into account.


More information about the tahoe-dev mailing list