[tahoe-dev] Removing the dependency of immutable read caps on UEB computation
shawn at willden.org
Mon Oct 5 00:49:23 UTC 2009
On Sunday 04 October 2009 02:25:53 pm Brian Warner wrote:
> So, one suggestion that follows would be to store the immutable
> "share-update" cap in the same dirnode column that contains writecaps
> for mutable files.
Perhaps. I think re-encoding caps would have more a more specialized purpose.
They would be needed by a repair system.
> I suppose it's possible to have a re-encoding cap
> which doesn't also provide the ability to read the file
The re-encoding caps I described would not provide the ability to decrypt the
file, only to re-encode the ciphertext.
> in which case
> the master cap that lives above both re-encoding- and read- caps could
> be called the read-and-re-encode-cap, or something).
read-and-re-encode-and-verify. 'master' is much shorter :)
> I still don't follow. You could hash+encrypt+FEC, produce shares, hash
> the shares, produce the normal CHK readcap, and then throw away the
> shares (without ever touching the network): this gives you caps for
> files that haven't been uploaded to the grid yet.
But you also have to decide what encoding parameters to use.
I want to separate that decision, because I want to allow encoding decisions
to be made based on reliability requirements, performance issues, grid size
and perhaps even server reliability estimates. Many of those factors are
only known at the point of upload.
> Hm, we're assuming a model in which the full file is available to some
> process A, and that there is a Tahoe webapi-serving node running in
> process B, and that A and B communicate, right? So part of the goal is
> to reduce the amount of data that goes between A and B? Or to make it
> possible for A to do more stuff without needing to send a lot of data to
> node B?
> In that case, I'm not sure I see as much of an improvement as you do. A
> has to provide B with a significant amount of uncommon data about the
> file to compute the FEC-less readcap: A must encrypt the file with the
> right key, segment it correctly (and the segment size must be a multiple
> of 'k'), build the merkle tree, and then deliver both the flat hashes
> and the whole merkle tree. This makes it sounds like there's a
> considerable amount of Tahoe-derived code running locally on A (so it
> can produce this information in the exact same way that B will
> eventually do so). In fact it starts to sound more and more like a
> Helper-ish relationship: some Tahoe code on A, some other Tahoe code
> over on B.
Hmm. I didn't realize that segment size was dependent on 'k'. I thought
segments were fixed at 128 KiB? Or is that buckets? Or blocks? I'm still
quite hazy on the precise meaning of bucket and block.
This is a very good point, though. I wouldn't want 'A' to have to understand
Tahoe's segmentation decisions. I'm not sure why it feels acceptable to have
it know Tahoe's encryption and hash tree generation in detail, but not
segmentation. Maybe because segment sizes have changed in the past and it
seems reasonable that they might change again in the future -- perhaps even
get chosen dynamically at some point?
It's probably better to assume that all of this knowledge is only in Tahoe and
the client has to provide the plaintext in order to get a cap.
> (hey, wouldn't it be cool if local filesystems
> would let you store a bit of metadata about the file which would be
> automatically deleted if the file's contents were changed?)
That *would* be cool.
> Hm, it sounds like some of the use case might be addressed by making it
> easier to run additional code in the tahoe node (i.e. a tahoe plugin),
> which might then let you move "B" over to where "A" is, and then
> generally tell the tahoe node to upload/examine files directly from disk
> instead of over an HTTP control+data channel.
That would be very useful. I have to make copies of files before uploading
them anyway, so that they don't change while uploading (because I map the
file content hash to a read cap, so I need to make absolutely sure that the
file uploaded is the same one I hashed), and then Tahoe has to make another
copy before it can encode, so being able to tell Tahoe where to grab it from
the file system would reduce the number of copies by one.
On the "plugin" point, I'm thinking that I want to implement my backup server
as a Tahoe plugin. I'm not sure it makes sense to implement it as a part of
Tahoe, because Tahoe is a more general-purpose system. From a practical
perspective, though, my backup server is (or will be) a Twisted application,
it should live right next to a Tahoe node, and it should start up whenever
the Tahoe node starts and stop whenever the Tahoe node stops. Seems like a
good case for a plugin.
More information about the tahoe-dev