[tahoe-dev] Perf-related architecture question

Thu Jul 22 06:38:19 UTC 2010

Brian, 

> A side-effect (and arguably a benefit) of low utilization is that
> uploads don't mess up your other traffic very badly, which was
> convenient for a consumer application that ran for days at a time. Every
> couple of seconds, you get a gap in which other applications can get
> their data through. Having that sort of "breathing room" let us defer
> the development of more intelligent bandwidth-management schemes.

True... but on the other hand, an initial backup takes a long time, and
I'd like for it to be able to finish overnight (because my network can
support that) instead of needing to let it run for several days.

It would be nice to expose more of the tuning parameters as configuration
options.  Some of my machines have oodles of spare memory that I would be
happy for tahoe to use.  The round trip time and bandwith of a LAN are very
unlike the asynchronous DSL environment assumed by the code.  And some
users have unusually good Internet connectivity; I actually have symmetric
25Mbps.  In short, the defaults are all wrong for me.  :)

> servers. This thrashed the disk and used a lot of RAM. So Tahoe streams
> the file out: it encrypts+encodes segment[0] (typically 128KiB), uploads
> the blocks, waits for those transfers to complete, then forgets about
> seg[0] and starts the process on seg[1]. If the time it takes to push
> 128KiB over the network is not significantly larger than the round-trip
> time, your upload pipe will be underutilized.

This was exciting to read.  The encrypt+encode/transfer ping-pong
guarantees that we will either be using CPU, or network, but not both
simultaneously, leading to low utilization of both.  I'm very handy with
threading and googled up some information on the Python threading model...
and then I learned about the GIL, which guarantees very low returns to
multithreading.  (And this sort of circumstance is best solved by
multithreading, not multiprocessing.)  I was excited about writing a bit of
code that would use my threading skills while getting me to learn a new
language and contribute to a great project, then had my dreams crushed by
learning that the dominant Python interpreter is thread-hostile... so why
bother?  :(

> If there are only a few slower servers, we could probably afford to
> store those shares on disk, but there'd be a funny policy question of
> how far to let the fast servers race ahead versus how much storage we're
> allowed to use.

For my part, I'd be happiest allocating a chunk of memory (not disk) for
tahoe to use for general-purpose buffering.  I hate it when my disk is slow
-- virus scanners are evil :) -- but my computers tend to have more memory
than they really need.

-- 
Kyle Markley