[tahoe-dev] v1.5 status

Zooko O'Whielacronx zookog at gmail.com
Wed Jul 15 13:27:34 UTC 2009


Well, I guess it's the difference between bugs in the source code
itself that we are about to ship as "v1.5" versus operational issues
that effect TestGrid specifically versus operational issues that might
typically effect other users.

If it is not a new bug in the source code since the v1.4.1 then it
wouldn't be a "regression" (people who are happily using v1.4.1 would
not have a reason to refrain from upgrading to v1.5).  If it is an
operational issue that strikes rarely, doesn't cause great harm, and
is easy to work around then it isn't "critical".  (This part is, of
course, a judgment call.)

The investigation isn't complete, but so far it looks like the
situation on the test grid where you can't create new directories is
due to some combination of:

1.  Limitations in the code that were already present in the v1.4.1
release (therefore not a regression): ticket #540

2.  The limitation is that the testgrid web gateway
(http://testgrid.allmydata.org:3567 ) is not handling misbehavior by
some of the storage servers.  That's a bug, but it isn't probably
won't affect lots of users.  It can be "worked around" by fixing your
storage servers.

3.  The misbehaving storage servers are running TahoeLAFS-v1.3-r3747,
which is older than the current stable v1.4.1 release.  It's possible
(but again, without a complete investigation I don't know if it is
true) that the cause of the MemoryError in the storage server has been
fixed since then.

So, I think the next step are:

1.  Investigate more.  Are any other storage servers besides
tahoebs5.allmydata.com bs5c2 misbehaving?  Do the munin graphs of
bs5c2 show any interesting pattern in memory usage or other
statistics?

2.  Upgrade bs5c2 and reboot it, probably making TestGrid usable again.

3.  ?  Maybe experiment with adding some sort of kludge to
hard-shutdown in case of MemoryError.

I'm about to do #2, even though I don't want doing so to interfere
with #1, and then I need to go to work.  :-)

Regards,

Zooko

tickets mentioned in this e-mail:

http://allmydata.org/trac/tahoe/ticket/540 # inappropriate
"uncoordinated write error" after handling a server failure



More information about the tahoe-dev mailing list