[tahoe-dev] Keeping local file system and Tahoe store in sync

Wed Feb 4 02:11:33 UTC 2009

On Tuesday 03 February 2009 06:40:13 pm Brian Warner wrote:
> It's not fast, no.. in my experiments, hashing the whole disk is at least
> several hours, and sometimes most of the day. But I think we're both
> planning to use a cheap path+timestamp+size(+inode?) lookup table and give
> the user an option of skipping the hash when the timestamps are still the
> same.

Yes, except that I'd say the user has an option of forcing the use of the hash 
even though the timestamps are the same, because by default if the metadata 
matches, I don't hash.

If I could see a way to compress the hashing time further I would, but at 
least on my machine, which I think is fairly typical, the scanning and 
hashing is I/O bound, and there's obviously no way to avoid reading the data 
from disk.

> So, given a file on disk, you have to do almost the entire Tahoe upload
> process to find out what the eventual Tahoe readcap is going to be. This
> sounds like it's at odds with your plan to upload the "backuplog" before
> you finish uploading some of the actual data files. I'm not sure how to
> rectify this.

Hmmm.  I knew I should have read that code... it's been on my to-do list for a 
while.

Yes, that does mess up the plan to upload the backuplog before the data.  I 
could still do it at the expense of increasing the size of the log, by 
leaving the read caps out of the log entries and appending a table that maps 
hashes to read caps, but that's unpleasant.

I suppose generating the full read cap early and doing the upload later could 
still be a win for users whose machines have slow upstream connections.

Another option would be to put dircaps in the backuplog, but that would 
require all those key pair generations, at least the first time.

This requires some thought.

Thanks for pointing that out though.  I'm pretty sure that's the only 
potentially-invalid assumption I'm making.

	Shawn.