[tahoe-dev] Uploading huge files

Tue Jan 25 20:25:30 UTC 2011

On 1/25/11 12:40 AM, Michael Coppola wrote:
> Hey devs,
> 
> I seem to be having lots of trouble uploading large (as in, 8gb) files
> to my Tahoe-LAFS network through the FTP interface. I set Filezilla to
> its highest timeout, 9999 seconds, and it still hasn't finished
> processing the file by the end so my client times out and tries to
> re-upload it again. The file has not made it on the storage nodes yet.
> What would be the best way to transfer an 8gb file to the network?
> Thanks

Depending upon how your network is laid out, you might consider using a
"Helper". We developed the Helper to assist AllMyData customers avoid
the expansion penalty over their slow upstream DSL line by uploading
non-expanded (1x instead of 3.3x) ciphertext to a Helper node that sits
in the same colo as all the storage servers. Once the Helper has the
data, it does the encoding/expansion step and uploads the resulting
shares to the (now nearby) servers where bandwidth is plentiful.

The reason this might help deal with large files is that the
client-to-Helper connection has code to resume partial uploads. The idea
was that a client might lose their internet connection partway through a
transfer, and we didn't want them to lose all the progress they made.

 client------>Helper------>Storage Servers

The overall flow of control looks like:
 1: client hashes the file, computes encryption key and storage index
 2: client sends storage index to Helper, asks if the file is already in
    the grid, if "yes" then upload is finished, client returns success
 3: Helper asks client for remaining encrypted data, stores it locally
  3a: client encrypts next chunk of data, sends to Helper, loops
 4: when Helper has full ciphertext, it starts encoding+pushing shares
 5: when encode+push is finished, Helper tells client it's done
 6: client returns success

If the client is rebooted during the 3/3a loop, the client will re-hash
the file but then the Helper will recognize the file as partially
transferred, and will resume from where it left off instead of starting
from scratch. This transfer step should take about the same amount of
time as uploading the file with any other tool (FTP, HTTP POST, etc).

Step 4 is doing almost all of the work as a normal Tahoe upload, but
without the encryption (that's done on the client), and probably has
very fast local connections to the storage servers (so the share-push
should be fast). It's limited by CPU speed and protocol overhead, but
we've seen speeds around 5MBps/40Mbps in typical gigabit networks.

Also, step 4 continues even if the client disconnects. So you might
start an upload, finish transferring the ciphertext, then disconnect.
Tomorrow, when you try again, step 2 will tell you that the file is
already in the grid, and the rest of the process will be skipped.

If you have plenty of bandwidth from the client to the storage servers,
then the Helper won't make the overall upload any faster (in fact it
will make it slower, because step 3 and step 4 are not done in parallel
as they would be for a direct upload). But if your Helper has better
uptime than your client, then a Helper-assisted upload may be more
likely to succeed than a direct one. You want the Helper to be as close
as possible to your storage servers.

Now, that doesn't help the FTP interface meet Filezilla's timeout any
better, but it does mean that the second attempt doesn't have to start
again from scratch, so eventually it ought to succeed.

The FTP case is particularly slow, because the data gets stashed in
multiple temporary places. First your FTP client sends a copy of the
plaintext to the Tahoe FTP server, where it's stored in /tmp/ or ~/tmp/
or something. Then the Tahoe upload process gets to look at the file,
and if you're using a Helper, it will send an encrypted copy to the
Helper, which stores it until upload has finished. I don't know if our
FTP server terminates the upload if the client disconnects in the
middle.. I suspect it keeps going. I also don't know if the SFTP
interface behaves differently.

The way to tell what's going on is to look for high CPU usage on your
client node. You may be successful by starting the upload once, letting
it timeout, then wait for the client to keep working on the
not-actually-cancelled upload until it completes, then start a second
upload (which should hopefully notice that the previous one completed
and use its results).

(Tahoe doesn't necessarily deal well with two simultaneous uploads of
the same file: it's meant to have the second upload wait for the first
one to complete, but I think there are some bugs in there).

hope that helps,
 -Brian