[tahoe-dev] Perf-related architecture question

Wed Jul 21 18:41:59 UTC 2010

On 7/20/10 11:01 PM, Kyle Markley wrote:

KM> I am running a helper, and see that while the helper is fetching
KM> ciphertext that the storage nodes see essentially no activity. Makes
KM> sense. But what I don't understand is why it takes so long to fetch
KM> the ciphertext.

The client-to-helper traffic isn't pipelined, and the helper asks for
encrypted data from the client in 50kB chunks, so fast lines will see
low utilization. We picked a chunk size based upon likely behavior for
low-end consumer DSL lines (where the ratio of time-to-transmit-chunk to
round-trip-time would be low enough to get reasonable utilization). For
details, see the comment here:

http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/immutable/offloaded.py#L349

A side-effect (and arguably a benefit) of low utilization is that
uploads don't mess up your other traffic very badly, which was
convenient for a consumer application that ran for days at a time. Every
couple of seconds, you get a gap in which other applications can get
their data through. Having that sort of "breathing room" let us defer
the development of more intelligent bandwidth-management schemes.

In general, we've been pretty conscious about memory footprint: one of
the big complaints about Tahoe's predecessor (named "Mountain View" or
"MV", derived from the Mnet/MojoNation/HiveCache codebase) was the
hundreds of MB it would regularly consume, so we made a rule that
Tahoe's memory footprint should never grow linearly with filesize. So we
process files in segments, which is frequently at odds with deep
pipelining that would improve performance.

This extends to hard drive space as well as RAM: a Tahoe client doesn't
usually write file contents (or derivatives, like encrypted/encoded
shares) to disk during normal operations. There's a partial-download
cache which breaks this rule, but it's going away with my new
downloader.

And as Zooko points out, the Helper wasn't really designed to speed up
fast upstream pipes. It was specifically built for AllMyData customers
who live on the wrong end of an asymmetric home DSL line, to mitigate
the bandwidth costs of the default 3x expansion factor.

On 7/21/10 9:20 AM, Zooko O'Whielacronx wrote:

Z> Disclosure: I never liked the erasure-coding helper. I wanted to
Z> improve the existing upload and repair-and-rebalancing instead of
Z> implementing a second kind of upload.

I'm +0 on slowly killing off the Helper. I haven't been too fond of the
alternatives, though. The best scheme I can think of is to upload just
'k' of your shares and then beg somebody else (with better up/down
bandwidth to the servers) to immediately repair your file for you. This
results in more bandwidth overall (the servers must transmit a full copy
of your file to whoever's doing the repair), and requires more
flexibility out of our Accounting scheme (the repair-generated shares
should count against the original user's quota, not the repair node's
quota). However, this scheme wouldn't take much longer to complete than
the Helper-assisted upload.

I fully agree with Zooko about the engineering costs of having both
Helper-based and non-Helper-based uploads, and how much a drag it would
be to keep using the Helper as we enhance share-placement policies. The
upload-k-then-beg-for-repair scheme would still be more development work
than purely client-driven uploads, but that work would be useful
elsewhere, so it'd be less work overall.

KM> I'm also curious about how the helper distributes shares to the
KM> storage nodes. In my configuration of 4 storage nodes, 3 are wired
KM> at 100Mbps and 1 is wireless. It looks like when the helper is
KM> distributing shares, this happens at roughly the same pace to all
KM> nodes, despite some nodes having faster connections than others. I
KM> would have expected the wired nodes to finish receiving their shares
KM> significantly sooner than the wireless node.

This is a consequence of our memory/disk-footprint policy. The MV
codebase used to encrypt+encode the entire file immediately, writing the
shares to local disk, and only then start uploading those shares to the
servers. This thrashed the disk and used a lot of RAM. So Tahoe streams
the file out: it encrypts+encodes segment[0] (typically 128KiB), uploads
the blocks, waits for those transfers to complete, then forgets about
seg[0] and starts the process on seg[1]. If the time it takes to push
128KiB over the network is not significantly larger than the round-trip
time, your upload pipe will be underutilized.

To finish pushing to a fast server earlier, we'd need to buffer the data
that's still waiting to be sent to the slow server, either in RAM (eek!)
or on disk (thrash!). The big problem with using the disk is the access
patterns. You'd encode the file in linear order (seg[0], then seg[1],
then seg[2], etc). If the servers are kept in lock step, then you'll be
reading those encoded shares off disk in the same order. But if some
servers jump ahead of the others, now you're seeking all over the place
as you deliver data from various segments. In MV, this turned into a
worst-case matrix transposition (we wrote to the file in column order,
and the share-pusher read out of the file in row order), and the disk
activity made customer's systems unusable.

If there are only a few slower servers, we could probably afford to
store those shares on disk, but there'd be a funny policy question of
how far to let the fast servers race ahead versus how much storage we're
allowed to use.

Z> Note that there is a pipeline which will store up to 50 KB of
Z> outgoing data per storage server in-memory. In practice if you have
Z> max-seg-size=128 KiB and k=3 (defaults) then you pipeline up to two
Z> outgoing blocks per storage server.

Yeah, I'd forgotten about that.. we set the value to match maxsegsize/k,
and each server connection has a separate pipeline. Your throughput
ceiling is always going to be windowsize/RTT, though, just like with raw
TCP.

I haven't yet decided if the new-downloader will pipeline block reads or
not, but it should be easier to implement in the new downloader than in
the old. It would bring us back to wanting some better bandwidth-control
tools, though, so that your home DSL line remains useful for other
applications while Tahoe is pushing or pulling hard.

cheers,
  -Brian