[tahoe-dev] Thinking about building a P2P backup system

Tue Jan 13 07:54:30 UTC 2009

On Fri, 9 Jan 2009 23:00:46 -0700
Shawn Willden <shawn-tahoe at willden.org> wrote:

> Is there infrastructure in place for append-only files?  That would
> work much better for incremental backups than storing the forward
> deltas as separate files.  Actually, the LDMF design you're talking
> about would make handling this in the backup server unnecessary.
> What's the status of that?  Does it makes sense to get that working
> first?

Nope, we haven't actually built any LDMF code yet, nor anything for
server-side append-only. Append-only makes the distributed-coordination
problem slightly easier (you send records to the storage servers, who
add them to a set for each file, and downloaders can decide how to deal
with any conflicts).

I think LDMF is a good thing to work on for better versioning, and for (of
course) handling large slow-moving files efficiently. But I think we could
get a fine backup system written without it, using only the SDMF files we
have already, so I'd try to write the backup stuff first, then LDMF. On the
other hand, if you're looking for a fun project... :).

> Download the last full version plus all of the forward deltas and
> then as you go along patching the full version to bring it up to
> date, generate a trail of reverse deltas.  Then upload the new full
> copy as the head of the revlog and the series of reverse deltas as
> the older versions.  New versions become forward deltas.

Hm, so instead of:

 FdddddFdddd

  (where 'F' is a full version and 'd' is a forward-delta)

you'd have:

 rrrFdddddFddd

  (where 'r' is a reverse delta)

and could retrieve some number of revisions that are older than the last full
version. If you want to trade off space against speed, you could prune that
down to:

 rrrrrrFddddd

Interesting.

> GC of versions older than the last full copy is a simple matter of
> removing them from the revlog.  Of course, that breaks the "append
> only" semantics you mentioned.

Yeah, well, I think the approach will be more like a pool of records (deltas
or full versions), and the server operations will be "add record" or "remove
record".

Oh, incidentally, part of the LDMF plan was to have a full DAG of revisions,
not necessarily a single line. That also helps with the coordination problem.
Suppose the LDMF file represents a directory, and two clients add a child at
the same time: with a DAG, some other client, perhaps one of the two
contenders, can merge the branches by producing a directory with both
children. With a single line, there are more situations that result in data
loss or require locking. The DAG preserves more information for the
application to do later conflict resolution.

I'm not sure if having a generalized DAG would interfere with using these
reverse deltas, but I suspect it would.

> Makes sense.  I don't know enough about erasure codes to know if 
> they "randomize" the data in the pieces to make patching
> ineffective.  It doesn't surprise me that this is the case, though.

I think they're linear, but yeah, you tend to get shares that are the XOR of
up to half the other shares, which I think would make patching difficult.

cheers,
 -Brian