[tahoe-dev] Unicode issues review

Brian Warner warner-tahoe at allmydata.com
Tue Feb 17 19:29:41 UTC 2009

On Tue, 17 Feb 2009 09:15:35 -0700
Shawn Willden <shawn-tahoe at willden.org> wrote:

> The problem with that is that there isn't necessarily any one
> encoding that works.  It would be nice if all the file names in a
> file system used the same encoding, but it isn't necessarily true.

Yeah, it seems to me that we've got two basic choices:

 1: prioritize round-trip fidelity, from system A, into Tahoe, and back to
    system A. This approach would use bytestrings in Tahoe dirnodes, do a
    best-effort display of childnames in the WUI, and never use unicode
 2: prioritize common access: use unicode in Tahoe dirnodes, try to interpret
    local disk filenames as unicode (perhaps with user assistance), find a
    way to deal with non-unicode characters

We've already decided to use unicode in Tahoe dirnodes, so I think we're
committed to something along the lines of #2.

I remember reading in the python-dev thread about this same issue
(specifically about what they should do in python-3.0) that some portion of
KDE used unicode internally for everything, but when they encountered a
high-bit filename that wasn't UTF-8, they recorded the bytes in some special
reserved portion of the unicode space, where they couldn't be usefully
rendered but were at least preserved. Then the filename could be accurately
regenerated later, even if it was basically uninterpretable garbage in the

We could do something like this in Tahoe: ask the user to tell us how to
interpret local-disk filename bytestrings (maybe we'll be lucky and they'll
use the same convention on the whole disk), but if the decode fails,
translate the bytes into the unicode reserved space. On the output end
(basically 'tahoe cp'), look for these reserved characters in the tahoe name,
and translate them back into high-bit characters in the local-disk name.

I have no idea what part of KDE they were referring to, nor how to find it,
but if we took this approach, it might be a good idea to use the same
reserved range that they do.


More information about the tahoe-dev mailing list