[tahoe-dev] several questions about tahoe backup

Brian Warner warner at lothar.com
Sat Mar 17 19:54:21 UTC 2012

On 3/1/12 11:50 AM, Dmitriy Kazimirov wrote:

> 98% of this are results of tahoe backup runs and deep-check --repair
> --add-lease takes more than 2 days to complete and it never completed
> successfully for one reason or another. Usually it crashes on
> network(I think related to gateway's network connection) errors and
> starts from beginning.

Yeah, I'm really satrting to think that we need that RepairAgent that we
keep talking about: something inside the tahoe client node (or inside
the "Agent" that we discusses briefly at the last Summit) which can
manage a rate-limited asynchronous interruptable/restartable
check/repair/renew process. Doing it entirely from the command line is
too fragile: it basically implies that the whole process needs to
complete before 1: your CLI command gets killed, 2: the node gets
bounced, 3: the client's connections to the storage servers remain

The hard part about the RepairAgent has always been how to configure it.
But I'm starting to think that tahoe.cfg should just be able to list a
couple of aliases and a frequency (once/day, etc).

> I tried to make specific aliases for each tahoe backup destination and
> parallize deep-check but new questions arise:
> - if other gateway performs deep-check on directory, and at same time
>   - tahoe backup writes new backup - is it safe?Will worse possible
>   result be 'Uncoordinate Write Error' but no data loss except
>   possible current backup session?

Backups use almost entirely immutable files and directories, and
immutable files are not vulnerable to collision problems. The only
mutable directory is the top-level one (which contains the timestamped
subdirectories and the "Latest" link). So the only potential danger is
that the "tahoe backup" is modifying that directory while the
"deep-check --repair" is repairing it.

It's hard to say what the chances are of bad things happening here. A
couple of thoughts:

 - "tahoe backup" modifies the top-level directory at the very end of
   the process, while "deep-check --repair" visits the top directory at
   the very beginning of the process. If you start both at the same
   time, they probably won't collide.
 - if the directory needed repairing, then the "tahoe backup"
   modify-directory operation will effectively repair it at the same
   time, perhaps reducing the chances that deepcheck--repair will try to
   repair it later
 - The chances of UCWE resulting in lost data depends upon the encoding
   parameters and how many simultaneous writers there are. With 3-of-10,
   assuming all shares are available, there'd have to be 5 simultaneous
   versions of the file before none of them are recoverable (two shares
   of version A, two of version B, two*C, two*D, two*E). Which
   theoretically means that one tahoe-backup writer, one repairer, and
   one previously-existing version should always result in at least one
   version being recoverable. But, this is highly dependent upon what
   sort of damage the repairer was trying to fix in the first place.

> - if several deep-checks are being performed on directory, from
>   different gateways(due for example script error) - is it safe?(I
>   knew it does not make sense to do so but what will be worse possible
>   result?)

(incidentally deep-check without --repair is always safe: it doesn't
modify the shares at all)

Multiple deep-check --repair operations on the same directory runs the
same UCWE risk as multiple writers trying to modify a directory at the
same time: the danger is that each will successfully replace shares on
different servers at the same time, resulting in some old "A" shares,
some new "B" shares, and some new (different) "C" shares. Having a
variety of share versions reduces your recoverability margins.

We haven't ever tried to seriously model or test this. It would be
interesting to create a directory, set up multiple clients, have them
all repeatedly hammer away at it (making modifications) and measure how
many UCWEs they get, and when/if the file becomes unrecoverable. It'd
probably be a good idea to instrument the storage servers to send
details of how the shares are changing to a common logfile, so we could
reconstruct the process later.


More information about the tahoe-dev mailing list