[tahoe-dev] Storing large trees on the grid

Brian Warner warner-tahoe at allmydata.com
Thu Jan 29 03:02:54 UTC 2009


On Tue, 27 Jan 2009 18:35:04 -0800
Benjamin Jansen <tahoe at w007.org> wrote:

> I executed "tahoe cp -r -v . bupc:backuppc" a while ago... probably  
> close to a week. After days and about 1.3M lines of "examining N of N"  
> output, it said:
> 
> attaching sources to targets, 0 files / 1 dirs in root
> targets assigned, 160300 dirs, 1058696 files
> starting copy, 1058696 files, 160300 directories

Whee! 160k directories, that's a lot.

So, first off, I'm not proud of "tahoe cp -r". It does the job well enough
for small stuff, but it's not very pleasant to use for large subtrees. Part
of the problem is that it's holding a lot of state in memory instead of
behaving more streamingly. Part of the problem is that it performs work in a
slightly weird order (it creates all the directories first, then starts
working on files). A third issue is that the feedback it provides is not as
useful as it ought to be. And the other problem is that it cannot be
restarted without losing a lot of ground.

The long initial delay was probably the creation of 160k directories, each of
which involves the generation of an RSA keypair, which (depending upon the
machine) can take up to a few seconds of intense CPU time. At two seconds
each, that's 3.7 days. Fortunately, when you run it a second time, it won't
need to create all those directories. Unfortunately, when you run it a second
time, it *will* need to retrieve all those directories (to prove to itself
that it doesn't need to create them).

(incidentally, when we finish our planned DSA-based mutable files, directory
creation will no longer involve a CPU-intensive prime-number search, and
creating a directory should be faster than uploading a small immutable file,
probably about 2*RTT)

> Right now, it claims that it has copied about 50K files and 7700  
> directories. If things keep going as they are, that means I have about  
> 5 months remaining. I'd rather not wait that long. :) I have a  
> synchronous 15Mbit internet connection; most of the time, when I watch  
> a graph of traffic at my router, it's sitting at < 5KB/sec out. So,  
> the bottleneck is definitely not my connection.

Yeah. There are a number of places that can slow down the process. With the
current max_segment_size of 128KiB, we see colo-to-colo uploads running at
about 1-2MBps. (a year ago, when that segment size was 1MiB, they were
running at closer to 5MBps, so clearly there's an overhead cost to having
more segments, which we haven't really investigated yet).

We aren't pipelining the upload process, so round-trip-time and server
latency will directly affect the upload speed. The upload is limited to the
slowest server, so if you have even one heavily-loaded storage server, that
will affect the overall download speed.

In addition, since "tahoe cp" speaks HTTP to a Tahoe node's webapi port, the
files that you're backing up take a tortuous route to the servers. Assuming
that your Tahoe node is running on the same machine that you're invoking
"tahoe cp" upon:

1: "tahoe cp" reads the file off disk, writes then to an HTTP connection
2: the HTTP connection goes through the loopback socket
3: the tahoe node reads the file from the HTTP connection
    if the file is larger than 100KB, it gets written to /tmp, else stays in RAM
4: the tahoe node reads the file once, to hash it and calculate the AES key
5: the tahoe node finds the right storage servers
6: the tahoe node reads the file a second time, hashing+encrypting+encoding
7: the encoded segments are sent to the storage servers, one at a time

Some day, I'd like a form of "tahoe cp" to instruct the tahoe node to read
the source file directly off of disk, bypassing steps 1, 2, and 3.

In addition, we're thinking about adding and option to disable local
convergence and uploading each file with a random encryption key: this would
bypass step 4.

> Based on my understanding of BackupPC's backend storage, most of those  
> million files are hard links. Knowing what BPC is backing up, I'd say  
> 10-20% are unique files. Does "tahoe cp" recognize hard links and copy  
> them as such?

Effectively, yes, but not in a particularly efficient way. "tahoe cp" doesn't
pay attention to the inode number at all (which would be necessary to
discover that two files are actually the same object) What happens is that we
upload the first file normally, then for the second link (which looks like a
regular file with identical contents), we go through steps 1+2+3+4, wind up
with the same AES key as we got the first time, in step 5 we discover that
storage servers already have shares for that file, then we bypass the step-7
uploading process. (we still have to hash+encrypt+encode, to generate the
rest of the file readcap, but we discard the results).

So the second upload of the file consumes all the CPU, but none of the
bandwidth. For small files this is pretty cheap, for large ones it's cheaper
than a new upload but still much slower than I'd like.

We've been talking a "tahoe backup" command that would be more clever about
avoiding uploads of previously-uploaded files (skipping all 7 steps).. please
see ticket #598 for details. Just today, we've started talking about what it
would take to efficiently back up local filesystems that use hardlinks for
shared files (like BackupPC and Time Machine. We don't have any answers yet,
but I imagine that the "backupdb" mentioned in #598 (and #597) could include
a table that maps from (devno, inodeno) to filecap, which would let us apply
the same "have I uploaded this before" speedup to hardlinks as it would to
files that have simply not changed since the previous backup.

> I thought about uploading a tarball instead of each file. The nature  
> of what I'm storing makes it unlikely that I would want to access an  
> individual file, anyway. However, my understanding is that tahoe  
> currently cannot store a 156GB file. Is that correct?

... yeah, I wouldn't bet on that working. I believe that current Tahoe trunk
should be theoretically capable of doing that, that is, I believe we've
removed the last of the 32-bit size fields which were limiting us to file
sizes in the 12GiB range (the 64-bit fields we're using should theoretically
push us into the exabyte range). However, I also believe that there are other
algorithmic and memory-footprint reasons why an upload of that size would
never complete. I successfully uploaded a 15GB file to a test grid (although
it blew out my /tmp filesystem at first, and took several hours to compute
the hash tree before it even contacted the storage servers), but my attempt
to download the same file made no progress for upwards of a day and I finally
gave up on it.

Increasing the max_segment_size value to, say, 4MiB or 16MiB might reduce the
hash-tree overhead considerably. I didn't try that.

Plus, given the chaotic nature of TCP connections, I'd be nervous that the
storage-server connections might flap before the upload completed, and we
don't yet have the ability to survive that (it gives up on that share, and
once you give up on more than 3 shares, the upload is abandoned).


Basically, there's per-object overhead to dealing with 160k directories and
1M files which makes it tough. There are also problems to dealing with one
humungous file. If you had a way to break it up into files of a few GB each,
that would probably work. But that may make it awfully hard to extract the
data you want from it afterwards (i.e. don't use /bin/split).

> I'd appreciate any advice on how I can speed this up - I'd like for my line
> to be the bottleneck. ;)

Yeah, I'd like that too :). 15Mbps is pretty fast, though, I don't know if
we're going to be able to saturate that without rewriting everything in C.


cheers,
 -Brian



More information about the tahoe-dev mailing list