[tahoe-dev] String encoding in tahoe

Francois Deppierraz francois at ctrlaltdel.ch
Mon Dec 22 16:03:34 UTC 2008


Hi Zooko,

I finally managed to find enough time today to investigate this issue
further on. Basically test_unicode_filename raise the issue of strings
which are not being converted as expected.

As Brian pointed out in [1], the current codebase is calling
simplejson.dumps with bytestrings coming from the command line. This
might sometimes work but is definitely not recommended. The same kind of
issues appears with UTF-8 filenames with the FTP or SFTP server.

We usually have UTF-8 bytestrings as input (sys.argv, filenames,
aliases, etc.) and need UTF-8 bytestrings as output (urls, filenames,
etc.). However, it is usually simpler and safer to use unicode strings
internally.

Kumar McMillan gives the following advise in his talk [2].

   1. Decode early
   2. Unicode everywhere
   3. Encode late

and to create wrappers for libraries which not unicode compliant (urllib
for example).

Does it sound coherent in the context of tahoe ? If so, the question is
where are the best places to handle theses conversions ?

Should we (1) automatically convert sys.argv[] from bytestring to
unicode in runner.runner(), or (2) do it selectively for each command
(put, cp, etc.).

I gave a try to (1), see patch [3], which indeed fixed the test failure
on slave3 (dapper box). However, it broke many tests at the same time,
mostly assertions in util/base32.py which seems to require bytestrings
instead of unicode strings.

François

[1] http://allmydata.org/trac/tahoe/ticket/534#comment:31
[2] http://farmdev.com/talks/unicode/

[3]
--- old-tahoe/src/allmydata/scripts/runner.py   2008-12-22
07:33:51.000000000 -0800
+++ new-tahoe/src/allmydata/scripts/runner.py   2008-12-22
07:33:52.000000000 -0800
@@ -33,6 +33,12 @@
            stdin=sys.stdin, stdout=sys.stdout, stderr=sys.stderr,
            install_node_control=True, additional_commands=None):

+    # Convert arguments to unicode
+    new_argv = []
+    for arg in argv:
+      new_argv.append(arg.decode('utf-8'))
+    argv = new_argv
+
     config = Options()
     if install_node_control:
         config.subCommands.extend(startstop_node.subCommands)



More information about the tahoe-dev mailing list