[tahoe-dev] Potential use for personal backup

Zooko Wilcox-O'Hearn zooko at zooko.com
Thu Jun 14 17:15:41 UTC 2012


Thanks for reporting about these backup measurements, Saint Germain!

On Mon, Jun 11, 2012 at 9:06 PM, Saint Germain <saintger at gmail.com> wrote:
>
> As a basis for comparison, here are the results for bup:
> First backup: 12693 Mo in 36mn

Hm, does bup do compression? If not, why wasn't this 24 GB?

> Second backup (after modifications): +37 Mo in 12mn

This is very cool! I want to steal this feature from
bup/backshift/Low-Bandwidth-File-System and add it to Tahoe-LAFS...


> Then I did some modifications:
>  - Rename a directory of small files (1000 files of 1 MB)
>  - Rename a 500 MB file
>  - Modify a 500 MB file (delete a line at the middle of the file)
>  - Modify a 500 MB file (modify a line at the middle of the file)
>  - Duplicate a 500 MB file
>  - Duplicate a 500 MB file and modify a line at the middle
>
> First backup: 24 GB in 111 mn
> Second backup (after modifications): +1751 Mo in 118mn
> Restore: 68 mn


>> The second-biggest performance win will be when you upload a file (by "tahoe backup" or otherwise) and after transferring the file contents to your Tahoe-LAFS gateway, and after the gateway encrypts and hashes it, the gateway talks to the storage servers and finds out that a copy of this identical file has already been uploaded, so it does not need to transfer the contents to the storage server.

> See my benchmark above. I seem to have no performance win at all ?
> But it has correctly detected the deduplicated file, so what could have gone wrong ?

I think what's going on here is that it takes just as long to transfer
the file from the tahoe client to the gateway and let the gateway
discover that it is a duplicate as to transfer the file from the
client to the gateway and then from the gateway to the storage server
for storage, since they are all on the same machine.

So you can't see any performance improvement due to the deduplication
when the gateway and the storage server are on the same computer.

If you want to see a performance improvement due to the deduplication,
then keep the gateway on the same computer with the client, but move
the storage server far away, e.g. to the free Tahoe-LAFS-on-S3 service
that we sent you email about. ;-)

As well as showing up a performance differential by making the
gateway↔connection much slower, this would also be a realistic
benchmark, since having your storage off-site is a useful thing to do
in real usage.

>> The biggest performance *lose* will be when you've made a small change to a large file, such as a minor modification to a 2 GB virtual machine image, and tahoe re-uploads the entire 2 GB file instead of computing a delta. :-)
>
> That I can confirm ;-)

Thanks. :-) I'd really like to experiment with using backshift's
variable-length, content-determined chunking and LZMA compression and
Tahoe-LAFS's storage.

Regards,

Zooko



More information about the tahoe-dev mailing list