[tahoe-dev] patch review

Sat Jan 3 01:38:55 UTC 2009

(I'm reviewing patches that landed while I was out on vacation).

[3292]: http://allmydata.org/trac/tahoe/changeset/3292
"refactor some common logging behavior into mixins":

 the use of self._parentmsgid is suspicious. It looks like two subsequent
 calls to log() like:

   self.log("foo")
   self.log("bar")

 will wind up with "bar" being the child of "foo", whereas they ought to both
 be siblings of the same original parent. In addition to not correctly
 describing the relationship between these events, when these
 long-chain-of-descendants messages are viewed in 'flogtool web-viewer',
 they'll scroll off the right edge of the screen.

[3305]: http://allmydata.org/trac/tahoe/changeset/3305
"immutable download: don't catch all exceptions, just DeadReferenceError
 and IntegrityCheckReject"

 I haven't looked closely enough to be sure, but does this mean that a single
 server could cause the whole download to fail, just by throwing some
 unexpected exception? That doesn't seem like a good ability to grant to
 storage servers.

 In general, when we're asking multiple servers for data, and any subset
 could do, I've tried to deal with potential exceptions in three classes:

  DeadReferenceError: log it as UNUSUAL, give up on that server. This
                      happens, there's no need to get flustered about it, but
                      it shouldn't happen every day. So I use:

                       log.msg("query failure", failure=f, level=log.UNUSUAL)

  Integrity failures: log it as WEIRD, give up on that server (or on that
                      block but be willing to use others). This indicates a
                      hardware problem, or a significant bug, or malice, and
                      somebody should look into it, but it doesn't prevent us
                      from downloading the file. I use:

                       log.msg("query failure", failure=f, level=log.WEIRD)

  Other exceptions: log it as WEIRD, give up on that server (or on that
                    block), make sure it gets reported to any unit test
                    that's currently running, then use other servers to
                    download the file. This indicates a significant bug or
                    malice, and somebody should look into it, but it doesn't
                    prevent us from downloading the file. We don't expect
                    this to happen, and the most likely way for this to occur
                    is when we're shaking the bugs out of new code, so we
                    need to make sure the exceptions are easy to spot while
                    we're running unit tests (i.e. we really want the unit
                    test to fail, and for Trial to print out this exception).
                    The allmydata.util.log.err function will report the
                    Failure to both foolscap logging and to
                    twisted.python.log.err(), and the latter will get noticed
                    by Trial (and will flunk the unit test). So I'd use:

                     log.err("query failure", failure=f, level=log.BAD)

[3306]: http://allmydata.org/trac/tahoe/changeset/3306
"util.base32: loosen the precondition forbidding unicode"

 As mentioned in the "String encoding in Tahoe" thread, I prefer arguments to
 be of one class only, and functions which accept unicode-or-bytestring make
 me nervous. Since you reverted the sys.argv conversion in [3311], I'd like
 to see [3306], [3307], and possibly [3308] reverted too.

[3323]: http://allmydata.org/trac/tahoe/changeset/3323
"generically wrap any errback from callRemote() in a ServerFailure instance"

 I'm worried that this is going to change the behavior of error-handling code
 in surprising ways. The core of the change is to add an errback to all
 callRemotes with:

  def _wrap_server_failure(f): raise ServerFailure(f)

 that stashes the original Failure in an attribute of the ServerFailure
 exception, and delegates __repr__ and __str__ through to the original.

 This is probably going to lose the stack frames which are normally stored in
 a Failure, as well as the exception type (so f.trap or f.check will be
 unable to examine the original exception type). This will deny the caller
 the ability to make decisions based upon the remote exception type.

 You might want the errback to just set an attribute on the original Failure
 instance, rather than using a distinct superclass as an indicator. Remember
 that you could get a KeyError from either side, and the calling code might
 want to know about it.

 Another option would be to create a marker class named "ServerException" or
 "RemoteException", then create a "RemoteServerFailure" subclass of Failure
 with an overriden check()/trap() method that says "yes" when asked about
 ServerException/RemoteException, and have the wrapper replace the original
 Failure with the new RemoteServerFailure. If RemoteServerFailure could
 delegate everything else to the original, this would give you the desired
 distinguish-remote-failure-from-local-ones property in a fully compatible
 way. It might also be possible to just monkeypatch the original Failure to
 change its check()/trap() methods to behave this way, without creating a
 whole new Failure subclass.

 Another approach (and the one that I think I usually take) is to use the
 placement of addErrback()s to identify remote failures, rather than trying
 to use the exception type. The following examples might demonstrate the
 distinction:

  1:
    d = doSomething()
    def do_remote(res):
      return rref.callRemote("foo", args)
    d.addCallback(do_remote)
    def process(results):
      return 2/results
    d.addCallback(process)
    def interpret_error(f):
      if f.check(DeadReferenceError): print "remote"
      if f.check(ZeroDivisionError): print "local"
    d.addErrback(interpret_error)
    return d

  2:
    d = doSomething()
    def do_remote(res):
      d1 = rref.callRemote("foo", args)
      def remote_error(f):
        print "remote"
        return recover_somehow()
      d1.addErrback(remote_error)
      return d1
    d.addCallback(do_remote)
    def process(results):
     return 2/results
    d.addCallback(process)
    def interpret_error(f):
      if f.check(ZeroDivisionError): print "local"
    d.addErrback(interpret_error)
    return d

 I tend to use the latter: if remote_error() fires, the problem occurred
 during the callRemote. This approach is not as convenient to use if you
 really want to bypass the post-callRemote processing in the case of error.
 It seems to work fairly well if the pattern is more "get data from server 1,
 if that fails try to get it from server 2, if that fails try server 3, etc,
 then process whichever data was successfully retrieved".

 This affects [3334] too.

[3326]: http://allmydata.org/trac/tahoe/changeset/3326
"storage: accept any size shares"

 As I mentioned on the ticket (#346), this needs a change to
 StorageServer.VERSION (to increase the advertised share-size limit) before
 clients will actually be willing to take advantage of larger shares. I'm not
 sure what value to suggest we advertise: we could use os.vfsstat() to
 estimate how much space is available (and subtract our reserved value) at
 startup and advertise that, or we could find a way to ask the system about
 the largest filesize we can store, or we could pick some arbitrary large
 number (like 2**64).

 Also, the line that does "max_size % 4294967295" is probably a bug, and
 should be replaced with "max_size % (2**32)".

cheers,
 -Brian