[tahoe-dev] [tahoe-lafs] #816: don't rely on notifyOnDisconnect()

tahoe-lafs trac at allmydata.org
Wed Oct 21 22:19:57 UTC 2009

#816: don't rely on notifyOnDisconnect()
 Reporter:  zooko         |           Owner:           
     Type:  enhancement   |          Status:  new      
 Priority:  minor         |       Milestone:  undecided
Component:  code-network  |         Version:  1.5.0    
 Keywords:                |   Launchpad_bug:           
 #653 was a long drawn out investigation that concluded that there is
 probably (but not certainly) a bug in foolscap in which
 {{{notifyOnDisconnect()}}} doesn't get triggered sometimes when it is
 supposed to.  Fixing (and writing automated tests for)
 {{{notifyOnDisconnect()}}} is quite tricky.  Also, it can never be 100%
 correct because of the problems of the inherent unreliability of
 communications and the limitations of the speed of light and so on.  My
 personal prejudice as someone who has long studied secure and fault-
 tolerant networked applications is that you should really avoid relying on
 such a service -- a service that attempts to tell you when a remote object
 has switched from "likely to respond in a timely way to your next request"
 to "unlikely to respond in a timely way to your next request", and instead
 design your system so that it works correctly and as efficiently as it can
 regardless of the pattern of connections-and-disconnections of the
 underlying comms subsystems.  (Hm, I guess this is an instance of the
 general idiom of "Don't check if it is likely to work and then try and
 then handle failure, instead just try and then handle failure.")

 Now, Tahoe-LAFS already does it this way!  For the most part.  There are a
 few places where we invoke {{{notifyOnDisconnect()}}}, but removing most
 of them would not diminish the functionality of Tahoe-LAFS.  One thing
 that ''would'' diminish its functionality is as Brian wrote on #653:

  * the welcome-page status display would be unable to show "Connected /
 Not Connected" status for each known server. Instead, it could say "Last
 Connection Established At / Not Connected". Basically we'd know when the
 connection was established, and (with extra code) we could know when we
 last successfully used the connection. And when we tried to use the
 connection and found it down, we could mark the connection as down until
 we'd restablished it. But we wouldn't notice the actual event of
 connection loss (or the resulting period of not-being-connected) until we
 actually tried to use it. So we couldn't claim to be "connected", we could
 merely claim that we *had* connected at some point, and that we haven't
 noticed becoming disconnected yet (but aren't trying very hard to notice).
  * the share-allocation algorithm wouldn't learn about disconnected
 servers until it tried to send a message to them (this would fail quickly,
 but still not synchronously), but allocates share numbers ahead of time
 for each batch of requests. This could wind up with shares placed
 0,1,3,4,2 instead of 0,1,2,3,4
 The first problem would be annoying, so I think we're going to leave tahoe
 alone for now. I'll add a note to the foolscap docs to warn users about
 the notifyOnDisconnect bug, and encourage people to not rely upon it in
 replacement-connection -likely environments.


 Since he wrote that, I realized that it would be cool if the welcome-page
 had a "ping all servers" button which then changed their statuses to
 indicate whether they responded to the ping or not (and how long it took).
 This would, in my opinion, be more reliable and more informative than the
 current "connected/not-connected" welcome-page.

 To close this ticket, make sure you have Brian's approval first, then add
 a "ping all servers" feature to the welcome page, then remove all uses of
 {{{notifyOnDisconnect()}}} from Tahoe-LAFS.

Ticket URL: <http://allmydata.org/trac/tahoe/ticket/816>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid

More information about the tahoe-dev mailing list