[tahoe-dev] Thinking about building a P2P backup system

Fri Jan 9 15:40:42 UTC 2009

This really does seem to be reinventing rdiff-backup.  Sure you can't  
save yourself a lot of effort by modifying that?

-Ian

On Jan 9, 2009, at 1:49 AM, Shawn Willden wrote:

> On Thursday 08 January 2009 09:17:40 am zooko wrote:
>> Have you seen this thread?  It might be a good project for you, as it
>> is self-contained, requires minimal changes to the tahoe core itself,
>> and is closely related to your idea about good backup:
>>
>> http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html
>
> I think it's exactly what I want to do.  Not exactly the way I want  
> to do it,
> but close.
>
> Here's an overview of how I think I'll approach the problem.  All  
> suggestions
> welcome.
>
> First, I'll run the backup as two separate processes.  A scanner,  
> which
> determines what to back up, and an uploader, which does the job of  
> backing
> things up.
>
> There are several reasons for separating the two tasks, but the  
> biggest ones
> are (a) I expect there will ultimately be several different scanners  
> and (b)
> because a large backup will take a long time (especially an initial  
> backup),
> the upload process needs to be resumable anyway, so there will have  
> to be
> some way of tracking where it is in the process.  Having an upload  
> queue
> generated by the scanner and written to disk is a good way.  An  
> interrupted
> scan can simply be restarted from the beginning, since it's  
> relatively cheap.
>
> Change detection will of course be based on size, mtime and ctime.   
> ctime
> needs to be examined to catch metadata changes.  The scanner will  
> check these
> values against a backupdb (more on that below).  If a full "hash  
> check" is
> desired, the scanner can simply queue every file for the uploader.
>
> The uploader will read from the queue and compute an rdiff signature  
> for each
> file.  If the file is in the backupdb, the uploader will check the  
> computed
> rdiff signature against the stored signature.  If they differ, it  
> will use
> the stored signature to compute an rdiff delta and upload that.  It  
> will also
> store the current metadata in the backupdb.
>
> To avoid missing changes due to limited mtime granularity the stored  
> metadata
> will have mtime set to min(mtime, start-1), where "start" is the  
> time that
> the uploader started reading the file to compute the rdiff signature.
>
> There is another possible issue, a serious one.  If the file changes  
> between
> the calculation of the rdiff signature and generation of the delta  
> the chain
> of forward deltas will be broken.  To avoid that, the uploader will  
> check the
> mtime after generating the delta and if it is greater than or equal
> to "start", the uploader will calculate a new rdiff while copying  
> the file to
> a temporary location, compute the delta from the temporary and use  
> those
> rdiff and delta values.  The copy is only really necessary if the  
> file is
> going to continue changing, but since it changed once while the  
> uploader was
> working, it's reasonable to assume that it might be a log file or  
> something
> else that's changing frequently.
>
> When the uploader has exhausted the queue, it then uploads the  
> backupdb.  In
> order to make that efficient, for incremental backups it uploads a  
> delta of
> the backupdb.
>
> I'm still thinking about how to structure the directories into which  
> all this
> stuff gets uploaded.  I think it'll probably be a single tree  
> structure that
> mirrors the backed-up filesystem structure, but each file will  
> become a
> subdirectory, into which the original plus deltas will be placed.  The
> backupdb and its deltas will go just above the three.  Oh, and all  
> of the
> various uploads mentioned are immutable.
>
> The downside of using forward deltas is that you have to have all of  
> them, in
> a complete unbroken chain, in order to get from the original to any  
> other
> version including the most recent.  This means that the longer the  
> chain of
> deltas, the less reliable the most recent version is, and that means  
> that
> we'll also need another maintenance process, one that has to have  
> the ability
> to retrieve the files and the deltas and reconstruct them all so it  
> can
> generate the most recent version and then generate reverse deltas,  
> probably
> discarding some of the backup intervals.
>
> There are lots of details to fill in, but that's the outline.   
> Comments?
>
> 	Shawn.
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev