[tahoe-dev] [tahoe-lafs] #1212: Repairing fails if less than 7 servers available

Thu Oct 14 04:40:38 UTC 2010

#1212: Repairing fails if less than 7 servers available
------------------------------+---------------------------------------------
     Reporter:  eurekafag     |       Owner:                            
         Type:  defect        |      Status:  reopened                  
     Priority:  major         |   Milestone:  1.8.1                     
    Component:  code-network  |     Version:  1.8.0                     
   Resolution:                |    Keywords:  reviewed regression repair
Launchpad Bug:                |  
------------------------------+---------------------------------------------

Comment (by zooko):

 I guess something that I haven't made up my mind about yet is how repair
 jobs (either {{{tahoe repair}}} command on the cli or clicking on the
 "check-and-repair" button on the wui) should handle the case that the
 upload/repair fails, or partially fails on some of the files.

 Should it proceed to completion, generate a report saying to what degree
 each attempt to repair a file succeeded, and exit with a "success" code
 (i.e. exit code 0 from {{{tahoe repair}}}), or should it abort the attempt
 to repair this one file, and should it also abort any other file repair
 attempts from the current deep-repair job?

 For example, suppose you ask it to repair a single file with {{{K=3, H=7,
 N=10}}}, and it finds out that there are only two storage servers
 currently connected. One storage server has 3 shares and the other has 0.
 Then should it abort the upload immediately? Or should it upload a few
 shares (3?) to the second storage server which currently has none, and
 then report to you that the file is still unhealthy?

 Here is one set of principles to answer this question (not sure if this is
 the best set):

 1. ''Idempotence'' if you run an upload-or-repair job, and it does some
 work (uploads some shares), and then you run it again when nothing has
 changed among the servers (there are no servers that joined or left and
 none of them acquired or lost shares), then the second run will not upload
 any shares.

 2. ''Forward progress'' if you run a repair job (not necessarily an upload
 job!), and it is possible for it to make {{{|M|}}} greater than it was
 before, then it will do so.

 If we use these principles then we give up on an alternate principle:

 3. ''Network efficiency'' if you run an upload or repair job, and it is
 impossible for it to make {{{|M| >= H}}}, then it does not use any bulk
 network bandwidth. (Also, if it looks like it is possible at first, but
 after it has started uploading then one of the servers fails and it
 becomes impossible, then it aborts right then and does not use any
 ''more'' of your network bandwidth.)

 I think people (including me) intuitively wanted principle 3 for uploads,
 but now that we are thinking about repairs instead of uploads we
 intuitively want principle 2.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1212#comment:26>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage