[tahoe-dev] Questions

Brian Warner warner at lothar.com
Thu May 13 23:29:33 UTC 2010


On 5/13/10 6:30 AM, Jason Wood wrote:

>> This is #467 and #573. In my mind these tickets are super-important:
>
> That would be extremely useful, if there is anything I can do to help
> with this in testing or working out use cases please let me know.

The most useful thing I can think of right now is to add your desired
use case to the #467 ticket. Something like "I have three datacenters
with 5 servers each and want to use 3-of-10 encoding and make sure that
I can still retrieve my files even when two datacenters are offline".

The second most useful thing would be for you to use Tahoe (probably in
a testing environment) for a couple of weeks, get used to the tahoe.cfg
format and the overall architecture, and then make some suggestions on
#467 about how you'd like to be able to configure your desired use case.
For example, "I'd like to have a section in tahoe.cfg where I describe
the properties of each server, and then another section where I describe
the rules of share placement", and some potential syntaxes for each. Or
"I want the server's tahoe.cfg to describe its properties (like
'datacenter=XYZ'), and then the client's tahoe.cfg can describe the
rules", or something like that.

(I think that explicit server lists are fairly easy to implement, but
not very flexible.. what should the client do when a new server is added
that their tahoe.cfg doesn't mention? (should it use it anyways?). Or if
one of the explicitly-listed servers is unavailable? (should it declare
failure?). On the other hand, describing a rule like "prioritize uniform
distribution of the 'server.datacenter' property" might be hard to
express clearly in the tahoe.cfg language, so we might resort to a
python plugin scheme of some sort).

> Does this negate the advantage of having the storage nodes use RAID-5/6?
> Would it make sense to just use RAID-0 and let Tahoe-LAFS deal with the
> redundancy?

The Allmydata grid didn't bother with RAID at all: each Tahoe storage
server node used a single spindle.

The "RAID and/or Tahoe" question depends upon how much you trust RAID vs
how much you trust Tahoe, and how expensive the different forms of
repair would be. Tahoe can correctly be thought of as a form of
"application-level RAID", with more flexibility than the usual RAID0/4/5
styles (I think RAID-0 is equivalent to 1-of-2 encoding, and RAID-5 is
like 2-of-3).

Using RAID to achieve your redundancy gets you fairly fast repair,
because it's all being handled by a controller that sits right on top of
the raw drive. Tahoe's repair is a lot slower, because it is driven by a
client that's examining one file at a time, and since there are a lot of
network roundtrips for each file. Doing a repair of a 1TB RAID-5 drive
can easily be finished in a day. If that 1TB drive is filled with a
million Tahoe files, the repair could take a month. On the other hand,
many RAID configurations degrade significantly when a drive is lost, and
Tahoe's read performance is nearly unaffected. So repair events may be
infrequent enough to just let them happen quietly in the background and
not care much about how long they take.

The optimal choice is a complicated one. Given inputs of:

 * how much data will be stored, how it changes over time (inlet rate,
churn)
 * expected drive failure rate (both single sector errors and complete fail)
 * server/datacenter layout, inter/intra-colo bandwidth, costs
 * drive/hardware costs

it becomes a tradeoff between money (number of tahoe storage nodes, what
sort of RAID [if any] you use for them, how many disks that means, how
much those disks cost, how many computers you need to host them, how
much bandwidth you spend doing upload/download/repair), bandwidth costs,
read/write performance, and probability of file loss due to failures
happening faster than repair.

In addition, Tahoe's current repair code is not particularly clever: it
doesn't put the new shares in exactly the right places, so you can
easily get shares doubled up and not distributed as evenly as if you'd
done a single upload. This is being tracked in ticket #610.

> More questions:
>
> Are links stored in the same way that files are? So if a storage node
> containing a link to a file goes down, will that link exist on another
> node?

Directories (which are tables of name/filecap pairs) are stored as
regular files, with the same encoding technique. So they are just as
safe/vulnerable to storage server failures as regular files are. The
usual way people use Tahoe is to remember a single "rootcap" that points
at their top-level directory, and not keep track of anything else. In
this mode, if you have a directory A which contains filecap B, then to
get the file back, you need to be able to read both A and B. So the
chances of getting the file back are reduced by the number of links you
must follow to get to the data. But we use so much redundancy that even
fairly deep directory structures are practically as reliable as single
files.

> Can there be more than one introducer node? The documentation seems to
> suggest there can be only one but this would be a single point of
> failure wouldn't it?

Not yet (see ticket #68, which M O Faruque Sarker will be implementing
this summer, again thanks to GSoC). Currently, the Introducer is indeed
a SPOF, but a fairly mild one, because the introducer is only needed
briefly when each node is started and joins the grid. After it connects
to the Introducer and downloads the current list of storage servers, it
doesn't need to talk to the Introducer again until a new server is
added. So, if the Introducer goes offline for an hour, the only impact
would be on clients or servers which join during that hour: all
previously-connected nodes will keep on talking to each other directly.

> Can there be more than one storage folder on a storage node? So if a
> storage server contains 3 drives without RAID, can it use all 3 for
> storage?

Not directly. Each storage server has a single "base directory" which we
abbreviate as $BASEDIR. The server keeps all of its shares in a
subdirectory named $BASEDIR/storage/shares/ . (Note that you can symlink
this to whatever you want: you can run most of the node from one place,
and store all the shares somewhere else). Since there's only one such
subdirectory, you can only use one filesystem per node.

On the other hand, shares are stored in a set of 1024 subdirectories of
that one, named $BASEDIR/storage/shares/aa/,
$BASEDIR/storage/shares/ab/, etc. If you were to symlink the first third
of these to one filesystem, the next third to a second filesystem, etc,
(hopefully with a script!), then you'd get about 1/3rd of the shares
stored on each disk. The "how much space is available" and
space-reservation tools would be confused, but basically everything else
should work normally.

> And finally, are there any large companies relying on Tahoe-LAFS at
> present? I'm trying to sell this to the powers that be and if I can drop
> some names I would stand a much better chance !

Can't help you there :). Maybe some of the users are hanging out on this
list and can jump in with their experiences.

cheers,
 -Brian



More information about the tahoe-dev mailing list