[tahoe-dev] connection timeouts on VG2

Greg Troxel gdt at ir.bbn.com
Tue Mar 27 00:34:16 UTC 2012


Brian Warner <warner at lothar.com> writes:

> On 3/26/12 4:54 AM, Greg Troxel wrote:
>
>>   If the keepalive is set to 120s, then it seems simply broken (and
>>   likely easy to fix) to have the keepalive happen at 240s. 120s as a
>>   keepalive timer should mean that a keepalive is sent 120s after the
>>   last recorded transmission.
>
> Yeah, I have to admit it's kind of weird. It's because I designed that
> code to minimize all forms of overhead:

> [snip]

The 10s timer makes sense to me; minor cpu and no disk seems ok.
If timers actually hurt this seems like a really good plan.  10s is
adequate fuzz, and 30s is probably ok.   One thing that would help is to
make sure timers are sent in time, not late - asking for 270s and
getting 240s is ok, but 301s is not.

Another idea: when there's traffic, note the time.  If the keepalive
timer is not running, start one for now + .keepaliveTimeout.  If it is
running, leave it be.  When the timer expires, if now-last >=
.keepaliveTimeout, send a keepalive, and otherwise set one for last +
.keepaliveTimeout.  That means there is only one timer firing per
.keepaliveTimeout, more or less, under active conditions or no activity,
and you actually get the right answer.  Sort of like resetting the timer
but with a lazy reset.

> I went with option 1 back in 2007 because I thought timers were
> expensive (in particular Twisted has to re-sort the timer list each time
> you change it). But I've never tested it to see how much overhead it
> really adds.. Tahoe doesn't use too many timers, so re-sorting the list
> for each arrival might not be such a big deal.

I see.   It seems like getting hard-to-follow behavior is not worth it
unless there's a tremendous amount saved.

> Yeah, unless the response is stuck behind a couple MB of data coming
> back from the other 15 servers we're currently talking to, or the
> video-chat session with which your roommate is hogging the downlink :).

Even when I had a 28.8 modem I never got more than about 5s of delay
blocking.

>>   It occurs to me that opening a new connection, retrying the RPC on
>>   it, and closing the old one if immediately is not so bad; since at
>>   100x RTT surely the old one is messed up somehow.
>
> Right. The thing that I've never been able to work out is how to set
> that threshold. It showed up in the new immutable downloader: I've got
> an event type ("OVERDUE") and code that reacts to it by preferring
> alternate servers, but no code to emit that event, since I don't know
> what a good threshold ought to be. I know it needs to be adaptive, not
> hard-wired.
>
> The issue is muddied by all the layers of the TCP stack. If the problem
> is down at the NAT level (router, wireless, etc), then 10x or 100x the
> TCP RTT time would probably be safe: if we see no TCP ACK in that time,
> conclude that the connection is lost. But how can we measure that? (both
> the RTT time, and the elapsed time-since-ACK).
>
> The closest we can get (with a single connection) to measuring the RTT
> is the response time to a short Foolscap RPC message, but that's going
> to be delayed by head-of-line blocking (in both directions) and other
> traffic. Maybe the shortest-recorded RPC-level RTT we have for the last
> hour, and hope that this will include at least one uncontested call?
> It'd be nice if sockets had a "report your low-level RTT time" method.
> As well as a "report how long you've had unACKed data" method: we can't
> measure that from userspace either. The closest we can get is
> .dataLastReceivedAt (which is why I went for the syscall-heavy
> approach-1 above, so this information would be available to
> applications). This isn't bad, but is still quantized by the TCP rx
> buffers.

Agreed this is hard.  Absent support for querying the transport
protocol, it seems that recording the min rtt at RPC level over the last
while is a good guess.

> Overall, the issue is one of resource optimization tradeoffs. To
> minimize network traffic, never ever send a duplicate packet: just wait
> patiently until TCP gives up. Or, to react quickly to lost connections
> and speed downloads, send lots of duplicate packets, even if the extra
> traffic slows down other users or you accidentally give up on a
> live-but-slow connection.

Agreed.  TCP gets this sort of ok for other than the buggy NAT
situation.

>>   So on balance I favor either no keepalive or a several hour default
>>   keepalive, switching to 270s for any peer that has a stall, with the
>>   270s being sticky for a week.  Optionally a node can further put 270s
>>   on all peers, if at least half the peers have had stalls within a few
>>   hours.
>
> Nice!

I picked that because it didn't seem that complicated, and thus seemed
to have lower risk of unintended consequences.

>>   It's interesting to consider not keeping connections open, but
>>   instead opening them on demand and closing after say 1m of idle
>>   time. There's a slight latency hit, but it should avoid a lot of
>>   issues and result in a lot fewer standing connections.
>
> Yeah, I suspect that when we get to tahoe-over-HTTP, it'll end up
> working that way. Sort of lowest-common-denominator, but that's
> practicality-over-purity for you. For data-transfer purposes, you really
> want to leave that connection nailed up (otherwise you lose a lot of
> bandwidth waiting for TCP to ratchet up the window size), but of course
> that gets you right back in the same boat.

I would be interested in seeing data for some real benchmarks.   I think
TCP resets cwnd during long idle times anyway.

>>   This sort of implies that only the forward direction is used for RPC
>>   origination, but I think that's a good thing, because backwards RPCs
>>   hide connectivity problems - while things work better on average
>>   they are harder to debug.
>
> I've benefitted from the backwards-RPC thing: one server living behind
> NAT was able to establish outbound connections to a public-IP client
> (which pretended to be a server with no space available), allowing that
> client to use the server normally. But I agree that it's surprising and
> confusing, so I'd be willing to give it up. Especially if we could make
> some progress on Relay or STUN or UPnP or something.

I have too.  But what I'm trying to argue is that a private-IP server
and a public-IP client is broken conceptually, and it leads to a grid
where the client when it has a public-IP can connect to the server and
later when it has a private IP cannot, and this causes more trouble than
it's worth.  Thus as a matter of policy I think we should not support
this, and servers that are on private IP addresses should have to set up
NAT or fail.

>>   How's IPv6 coming?
>
> Still waiting on Twisted, but I heard them making great progress at the
> recent PyCon sprints, so I think things are looking good. I don't think
> Foolscap has too much to do once Twisted handles it (code to call
> connectTCP6 or however they spell it, maybe some changes to
> address-autodetection), ditto for Tahoe (some web renderers probably
> need updating).

Glad to hear it.   It's still amazing to me that it doesn't have it; it
seems like 10+ years overdue.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20120326/f1357520/attachment.asc>


More information about the tahoe-dev mailing list