[tahoe-dev] BackupDB proposal

Brian Warner warner-tahoe at allmydata.com
Thu May 29 19:12:33 UTC 2008


> Rehashing the entire filesystem every time you're trying to run an
> incremental backup is obnoxious. It will take a large amount of disk IO and
> CPU time to do even a small backup.

For reference, "time find ~ -type f -exec sha256sum {} \;" on my local
workstation (about 44GB used in 545k files) took 45 minutes, mostly blocked
on disk IO (total CPU time was 16 minutes).

In contrast, "time find ~ -type f -ls" (with a hot cache) took 2.6 minutes,
even more blocked on disk IO (total CPU time was 25 seconds). With a backupdb
as defined earlier, the full backup should take only a few seconds more than
this.

> Putting the hashes alongside each file is also obnoxious for a couple of
> reasons, but most importantly because a backup is nominally a read
> operation. Scribbling all over the filesystem you're supposed to be backing
> up is a bad idea.

It would be really cool if the filesystem were to provide us with a reliable
indicator of whether the file had changed: a version number, or a strong hash
of the contents. Maybe HFS+ or ZFS or reiser4 or one of the other fancy new
ones could provide this feature.

I haven't measured it, but I believe that Apple's Time Machine does a full
backup of my home system (about 50GB) in maybe 2 or 3 minutes when nothing
has changed, so I must assume that they have a fast way to decide that a file
is probably unchanged without actually reading the contents. It would be an
interesting experiment to modify a file without changing its size, then set
the mtime back to its original value, then see if Time Machine notices.

> So, you could create a parallel tree with the file hashes, but if you're
> going to do that, then a database is faster and easier. And, FWIW, common
> practice in backup tools.

Yeah, relying upon timestamp+size seems pretty common: that's what rsync does
unless you ask it not to. I've gotten in trouble with it before (usually from
unit tests that were changing files faster than any human would), but I
believe most people would like the tradeoff.

cheers,
 -Brian



More information about the tahoe-dev mailing list