[tahoe-dev] split brain/partition tolerance? how handled in tahoe -- docs?

Wed Aug 8 17:56:25 UTC 2012

I didn't mean to hit any jugulars. I see network outages all the time at
the edges and
not many sites have the the billion dollars required to maintain a four
minute/year down 99.999% uptime. Some years I'm lucky, and some years I'm
not and see easily few days to two weeks of total disconnect time a year
per site and a couple years I've had even longer. Backhoes hitting fiber
optic lines and DSL lines going down so outages is what keeps dinner on the
table. For me, I don't consider it an unlikely scenario, it just happens
too many times. Kinda like saying, you don't need backups, failures rarely
happen. I have this funny conversation way too many times. just like
babies, new one is born every minute.
     Me> you have backups?
     Boss>and why do we need backups?...
     Me> well it is kinda like insurance WHEN the hard drives fail...
     Boss> and this happens all the time?...
     Me> no, rarely, but it DOES eventually happen. MTBF blah blah blah...
     Boss> and how much is this going to cost?....
     Me> XYZ blah blah blah dinner on my plate blah blah blah equipment
blah blah blah
     Boss> blah blah blah..ROI...blah blah blah....we can do without
backups".

To respond to an earlier post on this thread, I see a few scenarios, but
other scenario's that I'm thinking of is the source control repository
(don't even want to ask about database support yet). If the source control
repository was inside of Tahoe, inconsistency on a merge, can't even start
to figure out how much work is involved in recoverying 100% from that. (I'm
sure the intensity of this specific problem would be related to the
specific source control tools). I think this meets the criteria that was
mentioned. It is typically running on a server and patches are stored in
the repo as a single user, but there are constant checkins throughout the
day and when two sites or even home to corporate scenario, both sides have
to keep working. If the repo comes to a grinding halt, that is good, cuz
nobody checks in until the repo comes back online. If people are checking
their changes in and not knowing it might get lost, and not knowing it is
getting lost, is very bad. Since most users are discarding their workspace,
the bugs magically reappear later. QA guys really hate that. Without a
preserved copy of the patches somewhere in serverspace, recovery is I would
think too timeconsuming. I think the management decision would be to
recover from backups/snapshots back to a known working state and go from
there. that would definitely would loose some brownie points for tahoe. I
like the word Data Integrity because I'm concerned about 3 things --
availability, integrity, and confidentiality. I believe I need all three.
Looks like Tahoe has accounted for the C (with the LeastAuthority) and A
(with the distributed and .happy), my concern is in the I.

On Wed, Aug 8, 2012 at 1:53 AM, Zooko Wilcox-O'Hearn <zooko at zooko.com>wrote:

> There's a tiny chance that a very unlucky sequence of failures or
> network partitions, combined with the uncoordinated use of the same
> write cap by multiple people, will result in the irretrievable
> destruction of your Incoming directory. (To see why, think how you
> need K different shares of that directory to reconstruct it, and each
> writer is simultaneously writing out shares of their own new version.
> In a very unlucky scenario, each writer would succeed at writing fewer
> than K of their own version to the servers, and then suddenly
> disconnect from the Net. The result would be that there are fewer than
> K shares of each of several different versions, meaning that no
> version is recoverable and the directory is lost forever.)
>
> On the other hand, should that unlucky chance not strike, I suspect
> that the "automatic merging of directory modifications" feature -- the
> one that I just mentioned that I didn't like it and want to remove it
> -- is making sure that simultaneous uncoordinated adds and removes of
> children from that Incoming directory is reliable.
>
> (I still want to remove it, but now that I see people are relying it,
> I now feel an obligation to replace it with something better when
> doing so!)
>
> If you want to be safer, you give each uploader their own separate
> "Incoming-John" directory, and the curators use a tool to view all of
> the separate Incomings. That would eliminate the risk outlined above.
> (A tool such as "find" if LAFS is mounted via FUSE, or a custom script
> that runs "tahoe ls" on each of Incoming, or a custom web app that
> queries the WAPI.)
>
> Regards,
>
> Zooko
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at tahoe-lafs.org
> https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20120808/a20e7e3d/attachment.html>