[tahoe-dev] How to fix directory with multiple recoverable versions?

Wed Mar 24 16:27:11 UTC 2010

Dear Humberto Ortiz-Zuazaga:

I'm sorry that nobody replied to this when you posted it back on March 4:

On Thu, Mar 4, 2010 at 6:09 AM, Humberto Ortiz-Zuazaga
<humberto at hpcf.upr.edu> wrote:
>
>  Not Healthy! : Unhealthy: multiple versions are recoverable
>
>     * Report:
>
>       Recoverable Versions: 6*seq24-wgxp/10*seq26-4tdb
>       Unhealthy: there are multiple recoverable versions
>       Best Recoverable Version: seq26-4tdb
>
> I can't see how to fix this. A deep check with the repair checkbox
> leaves the directories in the same state.

Did you find a solution on your own yet?

I'm not sure what we intend the repair tool to do in a case like this.
Apparently we intend for it to warn you and stop, thus failing safe.

Have you tried manually inspecting the directory (if you do, I think
you will see the seq26 version of it, which I assume is newer than the
seq24 version), deciding if it looks okay, and then saving it again?
I guess if you do this, your gateway might detect the older versions
when it is uploading your new version and go ahead and overwrite the
older shares. I'm not sure if it does that though. Let's see...

Looking at tickets marked "mutable"...

Hm. There are a lot of tickets that show that mutable upload/download
isn't as robust as we would like in the face of unusual situations.
(Yours is an unusual situation: somehow the shares of an older version
-- seq24 -- weren't overwritten when a newer version was upload,
possibly because those six shares were on servers that weren't
reachable when you uploaded the newer version.)

Here are all the tickets that look vaguely related to the topic of
"robust upload/download of mutables":

http://allmydata.org/trac/tahoe-lafs/ticket/232# Peer selection
doesn't rebalance shares on overwrite of mutable file.
http://allmydata.org/trac/tahoe-lafs/ticket/474# uncaught exception in
mutable-retrieve: UCW between mapupdate and retrieve
http://allmydata.org/trac/tahoe-lafs/ticket/540# inappropriate
"uncoordinated write error" after handling a server failure
http://allmydata.org/trac/tahoe-lafs/ticket/541# foolscap
'reference'-token bug workaround in mutable publish
http://allmydata.org/trac/tahoe-lafs/ticket/546# mutable-file surprise
shares raise inappropriate UCWE
http://allmydata.org/trac/tahoe-lafs/ticket/547# mapupdate(MODE_WRITE)
triggers on a false boundary
http://allmydata.org/trac/tahoe-lafs/ticket/548# mutable publish sends
queries to servers that have already been asked
http://allmydata.org/trac/tahoe-lafs/ticket/549# MODE_WRITE mapupdate:
maybe increase epsilon to handle large batches of new servers better
http://allmydata.org/trac/tahoe-lafs/ticket/846#
allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on
a slow machine
http://allmydata.org/trac/tahoe-lafs/ticket/893# UCWE when mapupdate
gives up too early, then server errors require replacement servers

Oh man, we really need to focus on this stuff. Having all of these
undesirable behaviors (even, or especially, if they crop up only in
rare cases) saps some of the "aura of quality" feeling that I have
about Tahoe-LAFS.

However, everyone already has tasks underway for Tahoe-LAFS v1.7.0
(due in May), so I'm not sure when we're going to get a volunteer to
fix this stuff.

Anyway, we definitely need to open a ticket for Humberto
Ortiz-Zuazaga's problem, which might end up being cross-linked with
some of these other tickets. I suggest the ticket title "how to fix
'multiple versions are recoverable'?". Humberto: would you please open
that ticket?

It isn't clear to me that the repair tool, or a manual
inspect-and-resave, *should* do in this case. Just taking the version
with the highest sequence number and overwriting all the shares of the
older version might not be the right thing to do, if the other version
was caused by someone else simultaneously writing to that mutable
thing rather than by some of the servers being unavailable the last
time the (single) writer wrote to that thing.

On the other hand, simultaneous writers are not a supported use case,
so maybe it is perfectly fine for the recovery process to blindly blow
away anything with an older sequence number than the latest.

Maybe we can discuss that on the ticket.

Thank you.

Regards,

Zooko