[tahoe-dev] [tahoe-lafs] #1107: "sneakernet" servers

tahoe-lafs trac at tahoe-lafs.org
Sat Jul 3 21:22:00 UTC 2010

#1107: "sneakernet" servers
 Reporter:  warner        |           Owner:           
     Type:  enhancement   |          Status:  new      
 Priority:  major         |       Milestone:  undecided
Component:  code-storage  |         Version:  1.7.0    
 Keywords:                |   Launchpad Bug:           
 "Never underestimate the bandwidth of a station wagon filled with
  9-track tapes."

 Zandr and I were cooking up a high-volume low-bandwidth backup scheme
 the other day, to manage our large digital-photo libraries (after a
 shoot, we'll typically add 4-8GB of image files to the library, and
 update 10-100MB of metadata DB files). A lot of these changes are
 append-only. Uploading this much data over just a DSL line can take days
 or weeks. Using only the network is convenient but somewhat painful.

 We were sketching out a Git-based scheme, since all file-synchronization
 problems are really version-control problems, and because Git has some
 tools to create "packfiles" which contain a compressed form of all the
 data needed to get from version A to version B. The idea was to then put
 these packfiles on portable drives, and carry them from one machine to
 another, and then let a process on the receiving machine incorporate the
 packfile into the second copy of the archive.

 But, it'd be nice if you could use Tahoe for this. The simplest use-case
 would be a backup grid that has just one server node, and k=N, so you're
 uploading all the shares to the same place. Imagine that the server is
 at your office or some other place where you visit every day, and that
 the two machines have network connectivity but it's relatively slow
 compared to the amount of data you want to back up.

 Then you'd configure the client with a local directory that gets
 associated with the remote server. That local directory would actually
 live on a removeable drive. When the client creates a share to send to
 that server, it actually just writes it to the drive. A separate process
 slowly uploads shares from the drive to the server, using whatever
 bandwidth is available, removing them from the drive when it finishes,
 so if you just wait long enough, you'll get the same share distribution
 as without this change.

 But, when you leave for work in the morning, you unplug the drive and
 bring it along with you. When you arrive at the office, you plug the
 drive into the server machine, which notices it and starts copying
 shares off, deleting them as it goes. At the end of the day, you unplug
 the drive and bring it home, to repeat the process. A cheap 8GB flash
 drive used this way will achieve an average throughput of 740kbps, which
 is better upstream bandwidth than most high-end DSL lines, and a cheap
 100GB external HD swapped daily provides 10Mbps.

 The drive behaves like a "mail bag" that always moves back and forth,
 carrying as much data as can fit, or sometimes being empty when there's
 no work to be done. It's a layer-4 protocol, using humans for transport,
 and large removable drives as packets, with transmission control managed
 by the computers at either end.

 The client would want some tools to store shares locally, if a backup
 occurred while the removeable drive was elsewhere, and then detect it
 coming back and move the shares onto it. It might be a good idea to hold
 a copy of the shares locally until the remote server has confirmed
 receipt (via a set of DYHB queries over the network), and to tolerate
 drive failures by re-sending the shares once a failure is detected.

 It might also be nice to take advantage of multiple removable drives,
 and to allow bidirectional exchanges, or a rotating-ring system among
 your coworkers houses (sneakernet-augmented friendnet: each morning you
 drop a USB stick on Bob's desk, and Alice drops one off on yours, and
 with 5 participants, all data would get to everyone else's machines
 within a week).

 For extra credit, put LEDs next to the USB port to tell you when it's
 safe/useful to transfer the drive. Some USB flash drives have an e-ink
 display to show how full they are: hack the display to show a label that
 tells you where the drive needs to go next.

 The configuration for this should make it easy to keep sending small
 shares over the network, and only use the sneakernet carriers for large
 files: this probably means measuring/estimating the upload bandwidth and
 showing the user the upload queue, measured in units of time, and
 sorting the small shares to the front of the list. It would also mean
 changing the "file has been uploaded" completion semantics, since until
 the shares have finished migrating to their remote homes, the file would
 remain vulnerable to a failure of the local host.

Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1107>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized file storage grid

More information about the tahoe-dev mailing list