[tahoe-dev] Sharing Tahoe file system directories? Loops, lost objects, etc.

Wed Apr 16 02:35:29 UTC 2008

> I hope you and they don't mind my sharing some higher level messages
> regarding the Tahoe file system with my cap-talk colleagues.

Please do! Zooko and I both follow cap-talk (and I attend the friday
meetings), but I'm not yet a regular contributor there.

> > By encrypting the write-caps of the child objects (whether they be files or
> > subdirectories) the right way, we achieve transitive readonlyness, aka "deep
> > read-only", which (as you've pointed out) is far more useful.
> 
> This is different.  I'll guess and you can correct me.  I guess that
> you store the "write-caps" with a key that is part of the write capability
> to the directory node.  If that's the case then I also guess that you
> store two versions of each object that has a writable version in the
> directories, the writable version encrypted as above and the read-only
> (deep read-only of course for directory nodes) version that is unencrypted.
> 
> Is that guess correct about your "trick"?

Correct. Our dirnodes are a table with four columns in it:

 * child name
 * child read-cap
 * ENCRYPTED(child write-cap)
 * serialized metadata

And we use a hash of the dirnode's write-cap as the encryption key for the
third column. If a child object does not have a write-cap (as is the case for
immutable files), then we use an empty string instead.

The whole table is serialized and then encrypted with a key that is derived
from the read-cap. So you have to hold the read-cap to get anything, but you
also have to hold the write-cap to see the write-caps of the children.

> Hmmm.  I believe I can understand your design choice.  However, it doesn't
> seem to me that the choice was either/or.

True. The constraints we were paying attention to were the size of the caps,
and the ability to derive one from another without requiring a trip to the
server. We established two levels of caps: the write-cap, from which you can
derive the read-cap. By choosing to encrypt the child write-caps in the way
we did, we remove them from the reach of the dirnode read-cap.

The actual way we implement this is a bit complicated (see
http://allmydata.org/trac/tahoe/browser/docs/mutable.txt for extreme amounts
of detail), but in essence there is an RSA keypair called the Signing Key
(which provides write access to the mutable file) and the Verifying Key
(which is public knowledge but necessary for integrity checking). There are
also AES symmetric keys derived from the Signing Key by hashing. The first
such derivation is called the "write key", and it protects a copy of the
Signing Key that is kept in each share (we would prefer to put it in the
write-cap, but it is too large.. which is one of the reasons we're moving to
EC-DSA in the next few months). The write key also protects the encrypted
child write-caps. (note that we are, of course, using two different hash
derivatives for these two purposes. Never reuse a key.)

Then the "read key" is the hash of the "write key", and it protects the
contents of the mutable file itself.

If we had defined an intermediate key, such that:

 intermediate key = HASH(write key)
 read key = HASH(intermediate key)

Then we could use the intermediate key to encrypt the child write-caps, and
keep using the write-key to protect the Signing Key, and then we would have
the system you describe: write key gives you full write access, intermediate
key gives you read-only for this dirnode but read-write for all children, and
the read key gives you transitive read-onlyness.

> Why would I want a shallow read-only directory capability?  One example
> is to manage a project with other colleagues who I trust with write
> access to some of the underlying objects.  I can manage the project by
> choosing what to put into the shallow read-only directory (including
> whether some of the pieces are writable, shallow read-only, or deep
> read-only capabilities to directories) - nobody who I give it to can
> modify it - but everybody who I give the shallow read-only capability
> to can extract what's in it and write to that which I choose to share
> write access.

That's a good point. I'll make a note in our DSA design
(http://allmydata.org/trac/tahoe/ticket/217) to include such a layer. I don't
know how to present the distinction in a UI cleanly, though.. most people
aren't keenly aware of the difference, so there's a lot of opportunity for
security-breaking surprises if we offer shallow-readonly.

> The other sort of access that I've found useful in the past is
> "append-only".  This access allows one to insert a named capability
> into the directory but not to otherwise modify the contents (e.g. not
> to delete anything from the directory - rename, etc.).
> 
> In thinking about this a bit while considering the Tahoe file system,
> I don't think there is any cryptographic trick that is likely to achieve
> "append-only".

Heh.. I've put a lot of thought into this subject :). I would love to have
append-only files or dirnodes in Tahoe. The closest I've been able to work
out is a structure in which the servers accept signed requests to append a
record to the share. The "append-cap" is just this signing key, and the
servers are told the corresponding verification key. But every once in a
while the holder of a full-strength write-cap needs to come along and merge
all the append records down into the base section (and sign it with something
stronger than the append-cap), because until this happens, the servers will
be able to selectively remove the appends that they don't like. (you get some
protection by using multiple servers, but we'd prefer stronger properties
than that).

Append-only (and being able to distinguish an append-cap from a read-cap)
lets you create drop-boxes, which I remember fondly from the early days of
the Mac's LocalTalk-based networked filesystem.

> Since Tahoe file system capabilities are data capabilities that can be sent
> in messages, I don't believe such an "append-only" directory access mode is
> really needed.

True. Having them in the filesystem is nice, but possibly not worth the
effort.

> Hmmm.  I read the above details.  I have to admit that I don't understand
> the requirement for the proposed traverse and verify forms of directory
> access, "Writecaps beget readcaps.  Readcaps beget traversecaps. Traversecaps
> beget verifycaps."  e.g. where you say, "If we had traversal caps, then
> customers could give us the traversal cap instead of their read cap. We
> could still see the shape of their filesystem (and probably the length
> of their filenames, and the size of their files), but perhaps that would
> be little enough exposure that customers would be comfortable with
> revealing it."

This derives from a tension that we're experiencing right now with the
differences between Tahoe (the open-source project that Zooko, Rob
Kinninmont, and myself are developing) and the commercial allmydata.com
service (which Zooko, Rob, and myself are developing). If you're running a
Tahoe friendnet with a couple of technically-savvy buddies, you don't need
some of same features that a mom-and-pop-consumer -oriented service would
need.

There are a number of accounting/file-repair tasks that require somebody to
periodically (maybe once a week) walk through all of your files and do
something with the verify-cap of each of them. Depending upon how your
storage servers are managing their space, you might need to renew a lease on
each file, or you might detect that a server has failed and new shares need
to be regenerated ("repair"). If you are only paying for 1GB of space, then
somebody who cares about enforcing that limit needs to add up the size of all
your files and compare it against the limit. In addition, some users expect
the company that they're paying for backup services will help them recover
their files even if they forget their root directory write-caps.. the
password recovery problem.

We've been very careful to make sure that Tahoe can provide confidentiality
and integrity without relying upon the good behavior of any node but your
own. However, to provide password recovery (as well as quota checking, lease
renewal, and file repair), allmydata.com currently needs to hold on to our
customers' root-caps. I want to fix this, because I don't want us to be able
to see our customers' data any more than they do. If we had a
"traversal-cap", then we could perform three of those four functions with it
instead of the root-cap, which would bring us closer to the day when we don't
need to hold on to the root-cap anymore. And savvy users (who care about
their confidentiality more than password-recovery, and want to manage their
own root-cap) can give us only their traversal-cap and let us manage the
leases and repairs, but keep their files secret from us in the way that Tahoe
was designed to do.

> imagine a day (soon if I can get there from here) when some Tahoe
> directories contain capabilities from different introducers. E.g. if I set
> up my "friendnet" with a separate introducer (as I believe I'm forced to
> do) and then store some capabilities from both the allmydata introducer and
> my friendnet introducer into directories introduced by each. Perhaps you
> have such mixed directories now in testing?
> 
> How would/do traverse and verify capabilities to directories work
> in the above context?  How would you imagine them being used?

Oh, that's a fun and big bag of worms :). We've been arguing about this one
since the beginning. The short answer is that we don't know yet. In the
present design, each cap is only meaningful on a single grid, so you have to
keep them segregated.

To get short caps, you need to have some sort of context. The Tahoe
storage/peer-selection algorithm is based around the notion of a relatively
stable set of storage servers: there needs to be a certain amount of overlap
between the servers that were present when you uploaded the file, and the
ones that are present when you choose to download it. So the context is a
roughly-accurate set of servers.

If there is only the One Big Grid, then this is a question managing a large
and dynamic list of peers, and dealing with varying reliabilities,
availabilities, capacities, and nodes coming and going without warning. If
you have a bunch of separate grids, then any particular grid might be more
stable and reliable (at least it won't be changing quite so quickly), but now
you need some way to tell one grid apart from another.

We've kicked around the idea of a "grid id", and somehow make some room in
the file caps to hold it. Since caps are durable, the grid id can't change
over time. But for caps to be reliable and available, the grid id must remain
useful even as servers come and go.

Perhaps it can be a question of scope, like phone numbers and area codes: if
the cap doesn't have a grid id, treat it as belonging to the same grid that
you do. Or maybe the grid should be defined by the introducer (except for the
obvious availability problems that represents, and for the fact that we want
to move away from a single introducer rather than forcing ourselves to rely
upon it even more), and the introducer's FURL should be stuffed into each
cap.

One problem is the difference between identification and routing information.
A fairly short random number (say, 80 bit, give or take birthdays) is enough
to uniquely identify an uncoordinated object, and if you can coordinate their
allocation (or if there aren't very many objects) you can get away with even
fewer. It might be enough to add this number into each cap. On the other
hand, it takes more bits to give somebody enough information to find the
object on the internet (versus helping them decide that the object they think
they've found is the right one). It's the difference between the key
fingerprint and the connection hints in a web-key or FURL.

The challenge is even greater if you want this grid-id to be robust against
server failures. Obviously a single IP address isn't very robust. How about
the contact info for everybody who is currently in the target grid? Certainly
with a distributed introduction mechanism (which is where we want to go, see
http://allmydata.org/trac/tahoe/ticket/68), then any member of the group
should be able to introduce you everyone else. But that's a lot of data to
squeeze into a read-cap that's supposed to be sharable over IM. Maybe
somewhere in between these two extremes. Or maybe we declare a DNS scheme in
which each grid registers themselves with some central (but replicated)
facility, to help users translate their too-small-to-provide-routability
grid-ids to actual contact info.

Zooko feels that making the One Big Grid work is feasible, especially if we
use modern DHT scaling techniques like Chord. I'm less confident of that, so
being able to make the allmydata service use multiple separate grids feels
like a good fallback position (if it grows too big to manage, split it into
multiple separate grids).

Fun stuff :-).

> I believe such accounting issues can and should be kept as 'back end'
> mechanisms. It doesn't seem to me wise to make them visible through the
> directory interface (e.g. traverse and verify directory capabilities). I
> believe Allmydata should be able to find and account for all objects it is
> servicing for a given account *without* having to traverse any directory
> structure to find them.

Yes, the traversal caps are directly related to accounting and repair, and in
the long run I hope that they will not be necessary. But there are short- and
medium- term situations which will be easier to deal with by having them.
Consider the issue of garbage collection. My current plan is to have each
client build the "verifier manifest" starting from their root directory, and
write the list of files and directories that it finds in a local file. Then
tomorrow, the client does this again, and computes the differences. For each
file that they have deleted (or which was pruned because a directory was
unlinked), go to the storage servers and cancel the lease, allowing the
server to free up the space (if nobody else retains a lease on that share).

We haven't built this mechanism yet.. we hope to get to that point in the
next few months. Until we get there, we can imagine a number of operations
that require walking through the user's whole tree and doing something with
each node. But everything I can think of needing to do between now and then
could be done just as easily with the traversal cap, and I'd feel a lot more
comfortable working with those than with the root-caps. Even after that
point, there are services we might provide to customers (like file-repair)
that depend upon having an accurate manifest of which files the user cares
about, and even savvy users who maintain their own root-caps might want to
give us a traversal cap if it means they can leave their computers turned off
for months at a time.

> A related topic is that of directory loops, lost objects, etc.
> Perhaps without bias I can ask you about that topic to see
> where your considerations have taken you?  For example, while
> "lost" objects can't be found in any directory tree (e.g. no
> extant directory structure, perhaps the service for a needed
> directory is currently off-line) they are still stored.  How are
> they "accounted" for?

It depends upon how you're using the system, and we haven't really come to a
decision about many aspects. The primitive operation will be for the client
to add or cancel a lease on any given share. Each share can have multiple
leases if two or more people have that file in their filesystem. Holding a
lease means that you want to keep that file alive, in exchange for using up
some of your quota (or paying more, or whatever). The goal is to have each
user pay for whatever shares they have leases on, added up across all
servers. (please see
http://allmydata.org/trac/tahoe/browser/docs/accounts-pubkey.txt for some of
our design documents in this area).

Servers might not delete shares instantaneously.. it might be a lot more
efficient to do it as a batch job once a day, or once a month. (there's a
tradeoff between disk IO, CPU, simplicity, and how much extra storage
capacity you need to hold all of the shares that haven't been deleted yet).
So clients might choose to not add/renew a lease and just hope the file
sticks around anyways, or they might not add a lease to a popular file that
has a hundred leases on it already, hoping that not everybody else does the
same thing. A pay-per-byte accounting scheme might pro-rate your charges for
shared files, or charge you full price even if the file is popular (because
that would be far more predictable for both parties).

For the allmydata case we'll probably treat the customer's root-caps as the
roots of a reference graph, and prune everything that isn't attached (in
other words, orphaned objects are eventually gargage collected). Given the
virtual-drive-centric model that we're presenting to backup customers, this
seems like the clearest approach. This wouldn't make it easy to hold on to a
file that isn't in your vdrive (i.e. you upload a file, get its write-cap,
and write it down on a piece of paper).

Loops are easy to manage, as far as accounting goes, that's just a recursive
walk in a directed graph that might have cycles in it. We keep a set of all
nodes that we've begun to visit, and never visit a node that's already in the
set. The current manifest-generation code does exactly that. The accounting
part, when we build it, will work with just this set of objects, without
giving consideration to how many times a given object appeared inside the
vdrive.

> I should mention that if you'd prefer that I (we?) butt out of your design
> considerations and limit interactions to testing what you've implemented,
> I'll be happy to do so without rancor.

Thanks for the offer, but I think I'll decline :-). Please keep the
questions, suggestions, and challenges coming. We're developing Tahoe as an
open-source project because we want to make it the best that we can, and it
has already benefitted enormously from the community's experience. Even if we
(wisely or not) think that Tahoe shouldn't go in a particular direction, we
will be better off for having considered it.

cheers,
 -Brian