[tahoe-dev] Rates of file duplication

Jeremy Fitzhardinge jeremy at goop.org
Tue Sep 2 15:31:24 UTC 2008

I ended up writing a couple of perl scripts to generate file content
profiles, and compared a few of my machines.  The amount of sharing is
much lower than I expected, and confirms your 1% number pretty well.

I tried it on three machines:

    * lurch: 32-bit Fedora 9 server, 1588837 unique files
    * ezr: 32-bit Fedora 9 laptop, 687124 unique files
    * minilith: 64-bit Fedora 9 desktop, 1014310 unique files

All three are up to date, and all are have a moderately large chunks of
my user data copied on all three.

Comparing the two 32-bit F9 machines, which I would have thought would
be the most similar, I get around 42Gbytes - 16% - savings:
42019430400/263635070976 duplicate bytes, 15.9384827839636%
179728/2275961 duplicate files, 7.89679612260491%

and comparing all three there's 60 Gbytes of savings, or down to about 14%:
59280711680/420225855488 duplicate bytes, 14.1068691766142%
504620/3290271 duplicate files, 15.3367306218849%

I've put my tools and profile files up at http://www.goop.org/~jeremy/dups/


More information about the tahoe-dev mailing list