[tahoe-dev] [tahoe-lafs] #737: python2.5 setup.py test runs CPU to 100% on 32-bit single-core NetBSD "4"

Sun Jun 21 20:55:55 UTC 2009

#737: python2.5 setup.py test runs CPU to 100% on 32-bit single-core NetBSD "4"
---------------------------+------------------------------------------------
 Reporter:  midnightmagic  |           Owner:  warner
     Type:  defect         |          Status:  new   
 Priority:  major          |       Milestone:  1.5.0 
Component:  code           |         Version:  1.4.1 
 Keywords:                 |   Launchpad_bug:        
---------------------------+------------------------------------------------

Comment(by warner):

 Wow, fun! A quick look at the python-2.6 source
 (Modules/timemodule.c:floattime) doesn't suggest any obvious way to get a
 NaN.. it calls the C gettimeofday/ftime/time (depending upon what your
 platform has), adds the pieces together, and returns the result.

 You said that a simple test case that just calls time.time() repeatedly
 didn't ever fail, right? That's unfortunate.. if we didn't think Tahoe was
 involved then I'd suggest instrumenting timemodule.c to remember the
 pieces it got, build the !PyFloat, then if it's NaN immediately print out
 the pieces, so we could figure out what gettimeofday() returned that
 provoked a NaN.

 If there were a low-level threading bug that was clobbering memory, I'd
 expect to see exceptions or deeper errors than just a NaN. If time()
 couldn't allocate the memory for the !PyFloat object, it would raise an
 exception instead of returning NaN.

 Hm, it's worth noting that floats are formatted to strings (in
 Objects/floatobject.c:format_float) by doing snprintf(), so if your
 platform's libc does something funky with snprintf(), that might cause
 problems. Also, Python doesn't appear to do anything to define or test NaN
 directly: it just tells C to do a+b or a>=b or whatever. So something
 weird in your C compiler's implementation of floating-point math (or your
 CPU) could get involved too.

 If you get into this, you might try:
  * modify python's timemodule.c to store the values retrieved from
 gettimeofday() in a file-global variable, just before it adds them
 together to create the return value for floatseconds()/time()
  * add a function to timemodule.c which retrieves these stored values with
 as little interpretation as possible (maybe memcopy them into a string in
 addition to interpreting them as floats)
  * in your catch-NaN-in-reactor.callLater assertion, retrieve and print
 these values

 If we catch gettimeofday() returning something insane, it's either the
 kernel or some weird memory corruption that's just not causing anything
 else to catch fire. If gettimeofday() is behaving, then we should suspect
 the floatseconds() math or the floating point operations done afterward.

 Another idea is to add code to floatseconds() that stringifies the float
 and compares it against NaN right away. Then run everything under gdb and
 put a breakpoint on the 'if' side of that comparison, then start using the
 tahoe node until it fails in this way. Then look at the local variables in
 the debugger and see if anything looks suspicious.

 boy, you know how to find the fun bugs, don't you? :-)

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/737#comment:10>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid