[tahoe-dev] Perf-related architecture question

Thu Jul 22 18:07:10 UTC 2010

On 7/21/10 11:47 PM, Zooko O'Whielacronx wrote:
> On Thu, Jul 22, 2010 at 12:38 AM, Kyle Markley<kyle at arbyte.us>  wrote:
>>
KM>> This was exciting to read. The encrypt+encode/transfer ping-pong
KM>> guarantees that we will either be using CPU, or network, but not
KM>> both simultaneously, leading to low utilization of both.

KM>> .. then I learned about the GIL, which guarantees very low returns
KM>> to multithreading. (And this sort of circumstance is best solved by
KM>> multithreading, not multiprocessing.)

Z> This analysis is wrong. Tahoe-LAFS v1.7.1 has low utilization of
Z> network bandwidth, but this has nothing to do with multithreading or
Z> the Python GIL and everything to do with states where the client
Z> waits to hear back from the server before it takes the next step. In
Z> other words, it is all about lack of pipelining in the
Z> upload/download protocols.

Hm, I think you're both partially right. The ping-pong effect occurs
when the pipelining isn't deep enough to keep the CPU busy while the
network buffers finish draining. This is much more likely when you've
got fast upload pipes like Kyle's network. A simple way to mitigate it
is to upload multiple files at the same time: you're filling the network
"pong" phases of one upload with CPU "ping" phases of a different
upload. Another easy experiment is to increase the pipeline depth: edit
the "pipeline_size=50000 in src/allmydata/immutable/layout.py at line
97, in the WriteBucketProxy.__init__ arguments, and set it to something
bigger. The ideal value is your bandwidth*delay product divided by N
(number of total shares), then divide by the number of parallel
simultaneous uploads you plan to do.

The point at which threads/processes might help is when your network is
faster than a single CPU core, and you want to take advantage of
multiple cores. If you could run the CPU-intensive well-confined parts
of the upload process in a separate thread, then the main thread could
continue to serve the network (implementing Foolscap, SSL, etc) while a
different core works on AES/ZFEC. The GIL protects access to most Python
objects (to maintain the integrity of lists, dictionaries, objects,
etc). But CPU-intense things like AES or ZFEC could mostly work on raw
data without touching Python objects. The AES code could malloc a new
output buffer, encrypt the data into that buffer, then finally create a
Python string object around the buffer: all but the last step could be
performed without holding the GIL, and thus could run at full-speed on a
separate core.

Python is not particularly thread-friendly, but I've never missed it,
probably because I'm not very thread-friendly myself :-). The entire
point of Twisted is to use an event-driven reactor model instead of a
shared-state-concurrency locking model, and in my experience it provides
way less trouble and is "fast enough" for most purposes. Thanks to
Twisted, I've only had to debug maybe one simultaneous-update
interleaving bug in the last 5 years, whereas in the threading world,
those problems show up all the time.

So when needs like this come up, I look for non-thread-based solutions.
Multiprocessing is one way (both the Helper and the infrequently-used
Key Generator help here, by spreading some work over multiple
processes).. it use explicit messages between cores, avoiding the
pitfalls of shared-state concurrency. Pipelining is another.
Higher-level parallelism (i.e. uploading two files at once) is another
easy approach.

Z> At least, that's my assumption. I don't think anybody has yet
Z> measured carefully enough to prove the actual causes of the low
Z> network utilization. Maybe you could help with that! (See e.g. #809,
Z> but you might have better ideas for how to figure this out.)

 From a theoretical analysis point of view, for uploading one file at a
time and no pipelining, I expect us to let the upstream wire sit idle
for 1*RTT each segment (waiting for the ACKs), and each segment to take
like segsize*(N/k)/bandwidth to transmit, from which we should be able
to calculate an expected throughput and compare it against reality.
We'll actually accept N*pipeline_size bytes extra, so the idle time
should be something like 1*RTT - (N*pipeline_size/bandwidth). Varying
segment size (ticket #809) would be an interesting graph to produce, as
would adjusting the pipeline depth in WriteBucketProxy. It would be nice
to see how the theoretical formula compares to reality.

There will be a point above which increasing the pipeline depth doesn't
improve throughput. At this point, either your network or your CPU
should be saturated. If it's the CPU, then performance could be improved
further by either using multiple cores or improving the efficiency of
the upload code (reducing the CPU cycles per byte, maybe by moving from
Foolscap to plain HTTP [ticket #510]).

cheers,
  -Brian