[tahoe-dev] tahoe backup re-uploads old files
gdt at ir.bbn.com
Thu Mar 1 19:27:45 UTC 2012
Brian Warner <warner at lothar.com> writes:
> Yeah, that's a fair argument. I built "tahoe backup" because it seemed
> the best way to take advantage of tahoe's unique features. The
> orthogonal way to handle backups, as implemented in a zillion existing
> programs, generally expects a POSIX-like backend filesystem. Tahoe is
> both more and less than that:
> * it has immutable files and directories, which can safely be shared
> between subsequent backups
> * modifying files is expensive, and new files should be written
True, but I wonder if that means that a tahoe-specific backup program is
needed, or just one that uses a mostly-posix filesystem in a careful
way, so that it's reasonably efficient for a class of filesystems.
> * tahoe files need to be checked/repaired/renewed every once in a while
That's what deep-check is for, and I don't think it needs to be part of
the backup program.
> Using "cp -r" into a FUSE-mounted Tahoe filesystem would miss all of
> this: each pass would try to re-copy pre-existing files (unless you
> build a backupdb to avoid it), each pass would duplicate existing
That's true, but it's also an argument why 'cp -r' to an external HDD is
not a good backup scheme.
So I do think a backup program that is aware of the
write-file-don't-change-them notion and the sharing-of-existing-file
notion is needed.
> and the FUSE layer would add a lot of overhead. (I've never
> really been content with FUSE-over-Tahoe, it basically works, but the
> impedance mismatch is just too great to make it a happy experience).
I'm not convinced of the FUSE overhead claim, but I think part of the
concern is that we don't have a first-class FUSE implementation -
playing with py-filesystem is on my todo list. I've seen people run
glusterfs (on Linux and NetBSD), complaining about TCP performance
because they only get 40 MB/s instead of 75 MB/s (through FUSE), and
then get 75 when the driver bug is fixed. tahoe's speed seems slow
enoguh that it's hard to believe that fuse would slow it down much.
> Of course, it's also there because of historical Tahoe's origins in a
> backup-centric company.
a fair point
> FWIW, "tahoe backup" is basically a standalone program that speaks the
> tahoe webapi to achieve backup tasks, that just happens to use bin/tahoe
> as an entry point, and is distributed along with the rest of tahoe. With
> some architectural changes, it could be a plugin (sort of like how "git
> foo" vectors off to a program named "git-foo", so adding shallow plugins
> is as easy as dropping a git-foo executable into your $PATH). If you
> were to write an independent backup program that took advantage of
> tahoe's unique features (instead of targeting a POSIX filesystem), it
> would probably look a lot like src/allmydata/scripts/tahoe_backup.py .
I didn't know that, but the command-line integration wasn't the root of
my complaint - it's the use of a fs-specific interface when I haven't
convinced myself that it's really necessary.
> There are some other, similar tools that I'd like to have: "tahoe
> mirror" to do one-way syncing of local-fs to tahoe-fs, "tahoe sync" to
> do a bidirectional sync (ala Dropbox). And then I'd like "tahoe backup"
> to be more integrated into the tahoe daemon (or into an "agent", as we
> discussed at the last Summit), to be run periodically and safely without
> me having to set up a cronjob for it. And "tahoe sync" could be driven
> by inotify/fseventsd-style events. But, I'd expect to need to make
> similar arguments about why such features should go into Tahoe itself,
> rather than being implemented in standalone tools, before putting
> serious time into writing them.
Interesting points, and someday we will both have enough copious spare
time to discuss....
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 194 bytes
Desc: not available
More information about the tahoe-dev