[tahoe-dev] segment size, RTT and throughput

Brian Warner warner at lothar.com
Wed Mar 28 23:27:42 UTC 2012

On 3/28/12 3:37 PM, Vladimir Arseniev wrote:

> Is it still true in Tahoe-LAFS 1.9.1 that throughput for immutable
> uploads = segment size / RTT?

Close.. you need to include the expansion factor there, so
segsize*N/(k*RTT). There's a little bit of pipelining (50kB per
connection), mostly intended to allow the uploader do a bunch of small
writes (32-64 bytes) without incurring additional RTT stalls.

The pipeline size is set in src/allmydata/immutable/layout.py line 103,
and the comment around line 120 says:

        # k=3, max_segment_size=128KiB gives us a typical segment of
        # 43691 bytes. Setting the default pipeline_size to 50KB lets us
        # get two segments onto the wire but not a third, which would
        # keep the pipe filled.

> We're puzzled, and would appreciate comment.

> The client (Ubuntu VM with one CPU, 512MB memory and 1.3GB swap) does
> not deal well with more than five simultaneous uploads. Although all
> shares of all files do get uploaded, some of the files don't get
> linked to the grid directory, and their upload operations are missing
> from "Recent Uploads and Downloads".

Hrm, that's a worrisome bug. Could you file a report on it, maybe with a
shell script to reproduce the parallel uploads?

> We've also started to look at the impact of increasing segment size
> (client.py) and pipeline (layout.py), and we're puzzled. For 10MB
> files, increasing segment size to 1MB and pipeline to 40MB doesn't
> accelerate helper uploads (60KBps) but it does accelerate pushing to
> storage nodes (250KBps). Why might that be?

The helper-to-server push (or client-to-server push, when you aren't
using a helper) does one segment at a time, plus the small pipelining
described above. That should do better than one block per RTT, but not
as good as increasing the segsize. To compute the actual utilization..
Let's see, say "TS" is the total speed, so TS/N is the bandwidth
available per server. Each block is SS/k (for the default SS=128kiB
segsize, you get 44kB blocks), and the pipeline lets you have up to 50kB
in flight at any given moment, so you'd get two blocks on the wire, then
wait 44kB/(TS/N)+1*RTT before you get the ACK for the first one, then
you release the third block, then 44kB/(TS/N)speed+1*RTT later you get
the second ACK and release the fourth, etc. So in each slot you get SS/k
bytes to the server, and it takes (SS/k)/(N/TS)+1*RTT seconds.

By my math that reduces to a total speed of TS/(N+k*RTT*TS/SS). So if
your RTT is zero then you get TS/N (which is right), if it's big then N
is insignificant and the TS cancels and it degenerates to SS/(k*RTT),
which makes sense (you've got enough parallelism to ignore N, but too
much RTT for the pipeline to help, so you're limited by the block size).

The client-to-Helper connection uses a fixed chunk size of 50kB. The
relevant comment in src/allmydata/immutable/offloaded.py (line 359) is:

    # read data in 50kB chunks. We should choose a more considered
    # number here, possibly letting the client specify it. The goal
    # should be to keep the RTT*bandwidth to be less than 10% of the
    # chunk size, to reduce the upload bandwidth lost because this
    # protocol is non-windowing. Too large, however, means more memory
    # consumption for both ends. Something that can be transferred in,
    # say, 10 seconds sounds about right. On my home DSL line (50kBps
    # upstream), that suggests 500kB. Most lines are slower, maybe
    # 10kBps, which suggests 100kB, and that's a bit more memory than I
    # want to hang on to, so I'm going to go with 50kB and see how that
    # works.

    CHUNK_SIZE = 50*1024

If you have a fast-but-high-latency client-to-Helper connection, that's
probably limiting: you might try changing it (on the helper, not the
client: the protocol is "suck", not "blow") and see how it affects the
throughput. That'd be an excellent good-first-patch, to add a tahoe.cfg
setting (maybe "helper.chunksize=") to override that value. It'd also be
reasonable to set it equal to the segsize, since the memory-footprint
that results from the helper's chunk size is comparable to the footprint
produced by the segsize.


More information about the tahoe-dev mailing list