[tahoe-dev] Observations on Tahoe performance

Wed Aug 19 20:20:54 UTC 2009

Shawn Willden wrote:
> On Wednesday 12 August 2009 08:02:13 am Francois Deppierraz wrote:
>> Hi Shawn,
>>
>> Shawn Willden wrote:
>>> The first observation is that when uploading small files it is very
>>> hard to keep the upstream pipe full, regardless of my configuration.
>>> My upstream connection is only about 400 kbps, but I can't sustain
>>> any more than about 200 kbps usage regardless of whether I'm using a
>>> local node that goes through a helper, direct to the helper, or a
>>> local node without the helper. Of course, not using the helper
>>> produces the worst effective throughput, because the uploaded data
>>> volume stays about the same, but the data is FEC-expanded.
>> This is probably because Tahoe doesn't use a windowing protocol which
>> makes it especially sensitive to the latency between nodes.
>>
>> You should have a look at Brian's description in bug #397.
> 
> Very interesting, but I don't think #397 explains what I'm seeing. I
> can pretty well saturate my upstream connection with Tahoe when
> uploading a large file.

First off, I must admit that I'm somewhat embarrassed by Tahoe's
relatively poor throughput+latency performance. Its inability to
saturate a good upstream pipe is probably the biggest reason that I
sometimes hesitate to recommend it as a serious backup solution to my
friends. In the allmydata.com world, we deferred performance analysis
and improvement for two reasons: most consumer upload speeds were really
bad (so we could hide behind that), and we don't have any
bandwidth-management tools to avoid saturating their upstream and thus
causing problems to other users (so e.g. *not* having a windowing
protocol could actually be considered a feature, since it left some
bandwidth available for the HTTP requests to get out).

That said, I can think of a couple of likely slowdowns.

1: The process of uploading an immutable file involves a *lot* of
roundtrips, which will hurt small files a lot more than large ones. Peer
selection is done serially: we ask server #1, wait for a response, then
ask server #2, wait, etc. This could be done in parallel, to a certain
extent (it requires being somewhat optimistic about the responses you're
going to get, and code to recover when you guess wrong). We then send
share data and hashes in a bunch of separate messages (there is some
overhead per message, but 1.5.0 has code to pipeline them [see #392], so
I don't think this will introduce a significant number of roundtrips). I
think that a small file (1kb) can be uploaded in peer-selection plus two
RTT (one for a big batch of write() messages, the second for the final
close() message), but the peer-selection phase will take a minimum of 10
RTT (one per server contacted).

I've been contemplating a rewrite of the uploader code for a while now.
The main goals would be:

 * improve behavior in the repair case, where there are already shares
   present. Currently repair will blindly place multiple shares on the
   same server, and wind up with multiple copies of the same share.
   Instead, once upload notices any evidence of existing shares, it
   should query lots of servers to find all the existing shares, and
   then generate the missing ones and put them on servers that don't
   already have any. (#610, #362, #699)
 * improve tolerance to servers which disappear silently during upload,
   eventually giving up on the server instead of stalling for 10-30
   minutes
 * improve parallelism during peer-selection

2: Foolscap is doing an awful lot of work to serialize and deserialize
the data down at the wire level. This part of the work is roughly
proportional to the number of objects being transmitted (so a list of
100 32-byte hashes is a lot slower than a single 3200-byte string).
http://foolscap.lothar.com/trac/ticket/117 has some notes on how we
might speed up Foolscap. And the time spent in foolscap gets added to
your round-trip times (we can't parallelize over that time), so it gets
multiplied by the 12ish RTTs per upload. I suspect the receiving side is
slower than the sending side, which means the uploader will spend a lot
of time twiddling its thumbs while the servers think hard about the
bytes they've just received.

We're thinking about switching away from Foolscap for share-transfer and
instead using something closer to HTTP (#510). This would be an
opportunity to improve the RTT behavior as well: we don't really need to
wait for an ACK before we send the next block, we just need confirmation
that the whole share was received correctly, and we need to avoid
buffering too much data in the outbound socket buffer. In addition, we
could probably trim off an RTT by changing the semantics of the initial
message, to combine a do-you-have-share query with a
please-prepare-to-upload query. Or, we might decide to give up on
grid-side convergence and stop doing the do-you-have-share query first,
to speed up the must-upload case at the expense of the
might-not-need-to-upload case. This involves a lot of other projects,
though:

 * change the immutable share-transfer protocol to be less object-ish
   (allocate_buckets, bucket.write) and more send/more/more/done-ish.
 * change the mutable share design to make shares fully self-validating,
   with the storage-index as the mutable share pubkey (or its hash)
 * make the server responsible for validating shares as they arrive,
   replace the "write-enabler" with a rule that the server accepts a
   mutable delta if it results in a new share that validates against the
   pubkey and has a higher seqnum (this will also increase the server
   load a bit) (this is the biggest barrier to giving up Foolscap's
   link-level encryption)
 * replace the renew-lease/cancel-lease shared secrets with DSA pubkeys,
   again to tolerate the lack of link-level encryption
 * interact with Accounting: share uploads will need to be signed by a
   who-is-responsible-for-this-space key, and doing this over HTTP will
   be significantly different than doing it over foolscap.

So this project is waiting for DSA-based lease management, new DSA-based
mutable files, and some Accounting design work. All of these are waiting
on getting ECDSA into pycryptopp.

3: Using a nearby Helper might help and might hurt. You spend a lot more
RTTs by using the helper (it effectively performs peer-selection twice,
plus an extra couple RTTs to manage the client-to-helper negotiation),
but now you're running encryption and FEC in separate processes, so if
you have multiple cores or hyperthreading or whatever you can utilize
the hardware better. The Helper (which is doing more foolscap messages
per file than the client) is doing slightly less CPU work, so those
message RTTs might be smaller, which might help. It would be an
interesting exercise to graph throughput/latency against filesize using
a local helper (probably on a separate machine but connected via LAN)
and see where the break-even point is. We've got a ticket (#398) about
having the client *not* use the Helper sometimes.. maybe a tahoe.cfg
option to set a threshold filesize could be useful.

> For example, if the local node supplies the SID to the helper, the
> helper can do peer selection before the upload is completed.

Yeah, currently the Helper basically tries to locate shares, reports
back the results of that attempt, then forgets everything it's learned
and starts a new upload. We could probably preserve some information
across that boundary to speed things up. Maybe the first query should be
allocate_buckets(), and if the file was already in place, we abort the
upload, but if it wasn't, we use the returned bucket references. That
could shave off N*RTT. We should keep this in mind during the Uploader
overhaul; it'll probably influence the design.

cheers,
 -Brian