[tahoe-dev] Observations on Tahoe performance

Brian Warner warner at lothar.com
Tue Aug 25 08:59:10 UTC 2009

Shawn Willden wrote:
> Yeah, let TCP handle making sure the whole share arrives, then hash to
> verify. Why the concern about data buffered in the outbound socket?

To keep the memory footprint down. Think of it this way: we could take
an entire 1GB file, encrypt+encode it in a single pass, deliver
everything to socket.write(), and then just sit back and wait for the
ACK. But where will that share data live in the 12 hours it takes to get
everything through your DSL upstream? In RAM.

The actual call that Foolscap makes is a transport.write(), which is
implemented in Twisted by appending the outbound data to a list and
marking the socket as writeable (so that select() or poll() will wake up
the process when that data can become written). The top-most 128KB of
the list is handed to the kernel's socket.write() call, which is allowed
to just accept part of it, leaving the rest in userspace. Typically, the
kernel will have some fixed buffer size that it's willing to let
userspace consume: that space is decreased when socket.write() is
called, and increased when the far end ACKs another TCP segment (this
basically extends TCP's buffering-window into the kernel). I don't know
offhand how large this buffer is, but since every open TCP socket in the
whole system gets one, I suspect it's pretty small, like 64KB or so.

So the kernel will consume 64KB, and the transport's list (in
userspace/python/Twisted) will consume N/k*1GB. Badness.

Whereas, if we just put off creating later segments until the earlier
ones have been retired, we don't consume more than a segment's worth of
memory at any one time. We've always had low-memory-footprint as a goal
for Tahoe, especially since the previous codebase which it replaced
could hit multiple hundreds of MB and slam the entire system (memory
footprint was roughly proportional to filesize, whereas in tahoe it's
constant, based upon the 128KiB segment size).

In 1.5.0, uploading got a bit more clever, and it creates multiple
segments (but not all of them), and writes them all to the kernel, to
try and keep the kernel's pipeline full. It uses a default 50kB pipeline
size: running full-steam ahead until there is more than 50kB of
outstanding (unACKed) data, then stalling the encoding process until
that size drops below 50kB. So in exchange for another spike of 50kB per
connection, we get a bit more pipeline fill. The details depend, of
course, on the RTT and bandwidth.. in some quick tests, I got maybe a
10% speedup on small files over slow links.

> I really, really like grid-side convergence.  I'd vote for keeping it and 
> combining the message semantics.

Yeah, me too. It feels like a good+useful place for convergence. Without
it, clients must do considerable work to achieve the same savings
(specifically time savings, by not encoding+uploading things which are
already in the grid).

> 1.  My app -> local node -> helper -> grid
> 2.  My app -> helper (using helper as client) -> grid
> 3.  My app -> local node -> grid
> Option 1 seems to give the best performance. Option 3 obviously sucks
> because it means pushing the FEC-expanded data up my cable modem. It's
> not clear to me why 1 is better than 2. Maybe it's just from spreading
> the CPU load.

Yeah, that'd be my guess.


More information about the tahoe-dev mailing list