[tahoe-dev] Thinking about building a P2P backup system

zooko zooko at zooko.com
Wed Jan 7 22:09:21 UTC 2009


Dear Shawn Willden:

First of all: welcome!  It seems like we have a lot of goals in  
common.  Tahoe is currently blessed with an abundance of cool ideas  
and opportunities, but not as much of an abundance of active hackers  
-- we currently have about four or five semi-active contributors I  
would say.  So, if you're planning to contribute patches, bug  
reports, documentation, etc., then I'm delighted!

Second of all: I'm awfully busy right now racing to finish a  
Repairer, so I'll try to make this note brief.  My goal is to let you  
know which of your feature ideas are things that I also want and  
which are things that I don't particularly want.

On Jan 7, 2009, at 13:37 PM, Shawn Willden wrote:
> First, I want to be able to use the "backup" system to also  
> accomplish a sort of targeted filesharing for a subset of the  
> backed-up files.  The exemplar use case is my digital photo album.   
> I'd like to back it up to others' computers to keep it safe, but  
> I'd also like some specific people (e.g. my mom) to have high-speed  
> access to it.  In the Tahoe model I could push the image directory  
> onto the grid and give her read-only access to it, but given that  
> for a relatively small network of home machines (most of whom have  
> sucky upstream bandwidth), that would not achieve the "high-speed"  
> goal.

Have you tried it?  It might be just fine for sharing photos.  I use  
Tahoe to share photos, but I use the public test grid instead of a  
private grid, and so I'm using many servers located in a co-lo plus a  
handful of random servers operated by Tahoe hackers or curious  
users.  It seems to work fine.

> Further, for this application, the RS coding, encryption, hash  
> trees, etc. are all pretty unnecessary (probably -- more below),  
> and avoiding them allows easy use of rsync or similar protocols to  
> efficiently update the backup when changes occur.  Well, rsync  
> doesn't help with efficient updating of photos, but with many other  
> file types it would.

I'm reluctant to agree to support the option of not doing the erasure  
coding, encryption, hash trees, etc..  Several people have suggested  
things like this (which makes me worry that I am being stubborn), but  
nobody so far has actually demonstrated that these things cause too  
much of a performance problem or other problem.

I guess one of the main goals of Tahoe in my mind is to offer fully  
secure, erasure coded back-end without requiring front-end users  
(human or computer) to think about that stuff.

As far as I know, we are doing adequately well on that goal.  A few  
times people have asked to have the option to turn off the  
encryption, and in each case I asked them to please measure the  
performance and tell me if the encryption is causing a performance  
problem or another kind of usability problem.  None of them have  
written back.  I assume that this is because they got distracted and  
never got around to measuring.  ;-)  Maybe they decided to use a  
different storage tool that didn't come with that weird "security"  
stuff.

So, as to the idea of offering Tahoe without erasure coding or  
encryption: I don't particularly want that, unless someone can show  
me how much better it would be.

> So perhaps closer integration would make sense to allow the client  
> to automatically determine how many shares to push onto the grid in  
> addition to the targeted sharing (does Tahoe easily accommodate per- 
> file values for k and n?  Or does it actually make sense to alter  
> them to account for the existence of full backups?  I haven't  
> thought this through).

It is pretty easy to set the values of k and n on a per file basis.   
Also, as to the idea of "targeted sharing", there is something like  
that which I *do* want:

I want Tahoe to offer the user (human or computer) more control and  
more knowledge about which shares go to which storage server.

For example, you might rate some storage servers as highly available  
and others as rarely available but nonetheless having long-term  
reliability.  In that case, you might want to always store K shares  
on the highly-available servers and then store N-K shares on the  
rarely available but highly reliable servers.  As another example,  
you might want all files in your "photos/family" directory to have at  
least a certain number of shares stored on your mom's computer.

What I want, in general, is for Tahoe to expose this decision to the  
user (human or computer) so that such policies can be determined by  
the user without requiring changes to the Tahoe source code.

> Next, I want my backup system to be as close to configurationless  
> in the default installation as possible.

We definitely want that!

> then the second and later clients attempting to store a given file  
> will find it already stored.  That's not quite as good as not  
> storing it at all, but it's very close.

Yes, that's what it currently does (if you chose to share your "added  
convergence secret" with all clients on the backup network).

> To improve this, storage servers could index their local files and  
> note when a request to store a share for a file they possess arrives.

Interesting, but complicated.  I would like it if it worked , but I  
don't know if I want to take on the burden of supporting this sort of  
thing.

By the way, the GNUnet project offers that feature, so you should  
check them out.

> Next, I want incremental backups and versioning, and I want them to  
> be done bandwidth-efficiently.

Have you seen the duplicity plugin that Francois Deppierraz posted?   
Maybe that does exactly what you want.  :-)

> Finally, I want ultimately to be able to do a full-system restore.

Coool.  :-)

> Comments?  Would those sort of features be welcome additions to  
> your work?  Would you prefer that I just go away and do my own thing?

I would prefer if you used Tahoe and contribute patches, and if it  
turns out that there is some behavior that you really want and that  
seems to troublesome to me to risk including it in my codebase, then  
I would prefer that you copy the Tahoe darcs repository and develop  
your own branch.

Regards,

Zooko



More information about the tahoe-dev mailing list