[tahoe-dev] peter's upload problems + analysis

Brian Warner warner-tahoe at allmydata.com
Tue Mar 11 08:07:21 UTC 2008


Peter asked me to take a look at a problem he had uploading a large
file to our prodnet last night.

The file was 102.4MB in size, and was uploaded using the Helper
starting at 22:40:23.591 last night, storage index was
db4tqp6j5xbzn5x4gfq3uwdvry. The Helper did a quick check of the grid
and found that the file was not already present, so it told the client
to prepare for ciphertext fetch, which began at 22:40:24.411 . The
client began providing ciphertext at a rate of 23.4kBps (about 190kbps,
decent upstream bandwidth for a home DSL line). It continued to do so
for 72 minutes, with the last fetch request sent towards the client
(asking for offset 101324800, so about 1MB left to go) at
23:52:30.039 . This request terminated with a failure 210 seconds
later, at 23:56:00.680, with a ConnectionDone error, as the original
connection was replaced by a new one (the
duplicate-connection-suppression logic inside foolscap). This failure
caused the overall upload to fail, but of course with a broken
connection it could not report this failure to anybody.

Peter restarted the upload manually at about 00:13:42, whereupon it
picked up where it left off, and finished the remaining ciphertext
fetch in about 30 seconds and completed the encode+push in 123 seconds
(at a rate of 833kBps, which is much lower than the 2 or 3 MBps I was
expecting).

So the interesting question is why the helper witnessed three and a half
minutes of silence, followed by a replaced connection. Peter's tubid
(xbtc) is higher than the helper's (v5dw), and since I assume Peter's
home machine is behind a NAT box, that means his machine is both the
connection initiator and the foolscap master: it gets to decide which
connection is new and which one is old.

This implies that it was Peter's home machine that decided the
connection was no longer viable and initiated a new one. Foolscap will
do this when the kernel closes the TCP connection, generally because
outbound traffic has been unacknowledged for 5 to 15 minutes. It might
also be possible that a NAT box decided to drop the binding after 72
minutes of unidirectional traffic. Or perhaps his ISP doesn't like him.

Peter is going to grab his machine's logs and we'll look at them in the
morning.

cheers,
 -Brian



More information about the tahoe-dev mailing list