[tahoe-dev] Random access to files

Tue Jan 27 22:35:09 UTC 2009

On Tue, 27 Jan 2009 06:59:49 -0700
Shawn Willden <shawn-tahoe at willden.org> wrote:

> How hard would it be to add support for random access?

Not impossible, but not trivial. We specifically designed the share format to
support low-latency random access, but we didn't implement it in the
downloader. Since then, we've added support for certain forms of random
access, but it isn't complete.

> 3.  Modify the IDownloader interface to allow specification of a byte range.

FYI, IFileNode.read() already has this API.. it takes a consumer, an offset,
and a size. That's independent of what IDownloader provides, though.

> 4.  Modify some client interfaces to allow specification of a byte range.
      In particular, the webapi GET command should notice and honor the
      standard HTTP Content-Range header.

In current Tahoe trunk, a webapi GET call with a "Content-Range:" header will
work as expected (for both mutable and immutable files). However, the latency
(what we refer to as "alacrity") for the immutable file will be equal to the
time it takes for the webapi node to retrieve all bytes up-to-and-including
the end of the range from the storage servers (rounded up to the end of a
segment). This works by downloading the file to local disk, then serving up a
range from that file. Not pretty, but it allows things like the Quicktime
media player to stream movies.

(note that our SDMF rules let us assume that mutable files are relatively
small, so everything is kept in a single segment, and random access always
grabs the whole file. The rest of this discussion is about immutable files,
and would apply to multi-segment MDMF mutable files as well).

Your code analysis is correct. I know that the AES class didn't always
provide a way to start the counter at something other than zero, but I
believe it does now. The client-side helper code, in
immutable.upload.AssistedUploader or EncryptAnUploadable, is another place
that needs that sort of functionality, since the server-side Helper might
tell it to send ciphertext starting from the middle of the file.

The downloader code is currently designed in a big loop (one per segment),
whereas I'd like it to be in the form of a state machine. In addition to
making random-access easier, this would also make it easier to tolerate
servers which fail or get stuck, and it would be easier to prefer faster
servers.

I'm not sure how deeply you'd like to get into this. The full project
involves:

 * remove the Downloader class, move a lot of the functionality into
   immutable.FileNode, allow the FileNode to cache its hash tables and
   ReadBucketProxy instances
 * replace IFileNode.download() and download_to_data() with wrappers around
   IFileNode.read(), instead of the other way around
 * create a per-FileNode table of pending reads, each with a consumer, a
   Deferred, an indication of how much has already been written to the
   consumer, and probably a paused flag
 * make read() start by figuring out which segments are involved in the span
   and queue them for retrieval in the state machine
 * figure out which hash tree nodes are required for the currently
   interesting segments, keep track of which hash tree requests have been
   made already, send out new requests if necessary
   * merge adjacent requests to coalesce reads
   * implement immutable-readv(), since currently we only have read()
   * add ReadBucketProxy calls to fetch a subset of hash tree nodes, instead
     of always fetching the whole tree
   * keep track of a list of servers, sorted by how fast they appear to be.
     To bring up a new server requires establishing a connection and fetching
     some number of share hashes. If a server fails, remove it from the list.
     If a server is slow, move it to the end of the list. Prefer the servers
     at the front of the list. The initial peer-selection phase is used to
     figure out which servers have shares at all.
 * keep track of which segment requests have been made already, send out
   new requests if necessary. Make sure we get the hashes first, and make
   sure that we get the segments in order, so we can retire each segment as
   soon as possible
 * when a segment arrives, validate it against the hash tree, decode, decrypt,
   validate the ciphertext hash against its own tree, slice out the span that
   satisfies the pending read, deliver the data, maybe discard the segment
 * each FileNode should be allowed to cache one segment's worth of data, so
   that someone who reads the whole file one byte at a time doesn't cause
   Tahoe to do much more work than someone who reads the whole file at once.
 * remove the hackish local-disk file cache that supported random-access in
   the past

> I need to make sure I understand how this code works. Everything is a
> callback! It looks to me like the set of segments selected for download is
> defined in the loop in _download_all_segments:
> 
>         for segnum in range(self._vup.num_segments):
>             d.addCallback(self._download_segment, segnum)
>             # this pause, at the end of write, prevents pre-fetch from
>             # happening until the consumer is ready for more data.
>             d.addCallback(self._check_for_pause)
>         return d

Yeah, the reactor (and the event-driven framework in general) makes
everything into a callback.

Reducing the number of segments that get fetched is basically enough to
reduce the amount of work done for a single random-access read. You could
reduce it further by fetching a subset of the hash tree: to validate one
segment, you only need to fetch log2(Nsegs) hashes, but since the current
code assumes it's going to download the whole thing, it fetches all (Nsegs)
of them ahead of time.

Note that this basic approach means that two successive random-accesses (even
of the same byte) will take equal amounts of time. The full project described
above would fix this, and would allow random-access to be the primary access
mode (since full-file access could be implemented efficiently in terms of a
sequence of regular reads).

> Actually, it may be better to return the whole sequence of segments and an
> offset to tell the client where its range begins. That way the client can
> cache segments.

Eh, I'd be worried about confusing the client: "I asked for 500-600, and
somehow I got back 400-700". If you go that way, make sure the API makes it
very clear that the caller needs to examine the return value to locate the
data the asked for. Instead, I'd suggest having read(offset,size) or
readv([(off1,size1),(off2,size2)..]) return precisely what was asked for, and
provide a separate get_segsize_hint() which gives the caller a hint that they
can read(offset=N*segsize, size=segsize) faster than, say,
read(offset=N*segsize+1, size=segsize). If the caller wants to cache extra
data, they can ask for it explicitly.

cheers,
 -Brian