[tahoe-dev] measure your convergence

zooko zooko at zooko.com
Thu Mar 20 20:38:34 UTC 2008


Ever wondered how much storage space you would save if you and your  
friends coalesced all of your identical files?

Wonder no longer!  Now you can find out!  Install the "dupfilefind"  
utility [*] and run it with command-line arguments like:

dupfilefind --ignore-dirs="," --min-size=32 --profiles

(It probably works on all operating systems.)

It will recursively examine all files reachable from the current  
working directory and spew out a series of "hashcode filesize" pairs,  
where the hashcode is the least significant 8 bits of the adler32  
checksum of the first 8192 bytes of the file.

It will also mention whenever it finds two separate files on your  
system which are identical with each other.

Send the output to your friends, or to me -- zooko at zooko.com -- and  
we'll find out approximately how many of your files are shared with  
other people who submit results.  (Please compress your output with a  
good compressor like 7zip or rzip or bzip2.)

You take full responsibility for leaking all this information about  
your files -- namely their 8 bit adler32 sums of their first 8192  
bytes, and their file size.  Also, in case duplicate files are  
detected on your system, their device number and inode number.



[*] To install the dupfilefind utility, either download this tarball:


untar it, cd into the resulting directory, and run:

python setup.py install

or else install the easy_install tool:


and then run:

easy_install dupfilefind

More information about the tahoe-dev mailing list