[tahoe-dev] [Python-Dev] PEP 383 update: utf8b is now the error handler

Fri May 8 01:55:30 UTC 2009

Glenn Linderman wrote:
> On approximately 5/7/2009 8:40 AM, came the following characters from 
> the keyboard of Zooko O'Whielacronx:
>> Dear Glenn Linderman and SJT:
>>
>> You two encoding experts who have volunteered some ideas for Tahoe
>> might also be interested in this post that David-Sarah Hopwood just
>> sent:
>>
>> http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html
> 
> Regarding this proposal, I would assume (but the proposal should 
> clarify) that the proposal is looking at a filename, not a pathname, and 
> that each directory name in a path name would be independently processed 
> by the algorithms in this proposal.

Yes.

> The proposal has a lot of merit; it avoids the use of meta-data that, as 
> I pointed out in yesterday's comments, could get lost by transitions 
> between filesystems.
> 
> Whereas my comments yesterday suggested a directory into which 
> transcoded files could be placed, and that that was problematic (for the 
> unstated reason of separating files into two buckets), this proposal 
> suggests reserving the %% and %u and %U file prefixes for transcoded 
> files.  While it keeps the files in the same buckets (directories) which 
> is good, it raises the question of whether the prefix(es) is/are unique 
> enough to mostly avoid problems with name collisions.

True. However, if the representation of an incorrectly decoded filename
is not an invalid string, then it must necessarily step on some subset
of valid strings. It would be possible to use non-NFC strings or
strings containing Unicode noncharacters, but since those aren't
reliably representable as filenames in non-Tahoe filesystems, that
wouldn't satisfy the goal of allowing lossless transitions between
filesystems.

(Also, using noncharacters is strictly speaking not compliant with
the Unicode standard -- Unicode APIs are permitted to strip them or
treat them as an error.)

> If some prefix can be thought to be rare enough to avoid problematical 
> collisions, I would think it should be used consistently, just one 
> prefix, rather than 3 prefixes, which triple the chances for collisions.

The choice of prefixes is a minor detail, I think. The constraints on
the prefixes for the Unicode and byte-oriented encodings are:

 - they can be distinguished from each other;
 - they are printable ASCII, and representable in all common filesystems;
 - they are sufficiently rare at the start of real filenames;
 - if they contain cased characters, those characters are treated
   as case-insensitive;
 - they are not possible prefixes of reserved filenames.

> Seems like the distinction between 4-digit and 6-digit Unicode %U 
> encodings is the + after the %.

Yes. An alternative here is to use %HHHH%HHHH (or @HHHH at HHHH) where
each 4-digit hex value represents a UTF-16 code unit. Just using %HHHHHH
would obviously be ambiguous.

> The comment that % need not be escaped from shell commands in any common 
> operating system makes me wonder if the author has ever heard of 
> Microsoft Windows, or has tried to access a file name name
> 
> %%my%dear%faraway%Abby.doc
> 
> from a Windows command shell that has environment variables named "my", 
> "dear", and "faraway" defined.

Oops. I use cygwin on Windows; I had forgotten about the environment
variable convention in the cmd.exe shell. There are other characters,
such as '@', that could be used instead, and the rest of the proposal
is independent of which escape character is used.

> The definitions of %% and %u encodings do not mention escaping the 
> escape character.

Yes, I had considered that but just forgot to mention it. Mea culpa.

[...]
> Any such escaping scheme like this could possibly run into length limits 
> on the names, some discussion about that issue should be included in 
> such proposals.

This was mentioned in my proposal:

# - whenever a Tahoe filename is converted to a name for a
#   particular filesystem, if the result is too long for
#   that filesystem, then fail the operation.

There is little else that can be done: as you say, *any* escaping scheme
(including one using UTF-8B or private-use characters) might run into a
length limit for a particular filesystem. A filename that has no need
for escaping could also run into a length limit shorter than that of
Tahoe.

> The description of %% encoding seems unusual... there are no bytes that 
> do not correspond to ISO Latin-1 characters, except possibly for control 
> characters between 1 and 31 inclusive, if they are outlawed in Tahoe 
> file names (are they?  Need they be?).

"ISO-Latin-1 characters" was intended to mean Unicode characters
U+0000..U+00FF inclusive.

> So it seems that %% encoding 
> would only add a %% in front, and then be mojibake, if the byte encoding 
> was not originally ISO Latin-1.

That's not correct; canonical %%-encoding only generates filenames
containing POSIX-portable characters plus the escape character.

The conversions mention ISO-Latin-1 only because it is possible to
construct Tahoe filenames that start with %%, but contain Unicode
characters above U+00FF (and therefore are not a %%-encoding at all,
never mind a canonical one). Since the conversion functions from a
Tahoe filename to a Unicode or byte-oriented filename are intended
to be total provided that no length constraint is hit, they must
specify what to do in this case.

If would probably have been clearer, however, just to say
"is not a %%-encoding" rather than mentioning ISO-Latin-1. That
would also cover the case of filenames starting with "%%" but that
contain an escape character not followed by two hex digits.

Of course mojibake is still possible if a filename accidentally
decodes using the wrong decoding, but there is nothing much that
can be done about that.

> The comment that "The %% and %U encodings are never mixed" seems 
> impossible.

The comment is correct. The conversions only generate canonical
%%-encodings; since those

 - only contain POSIX portable characters plus the escape character;
 - start with a prefix that excludes them from being reserved filenames;

they should be representable on all filesystems, and so it is
unnecessary to further %U-encode them. (Note that this does not mean
that a filename that starts with "%%" cannot be %U-encoded. But any
given filename cannot be both a %% and a %U-encoding.)

> I posit a POSIX file name with a non-decodable sequence in 
> its original encoding; this forces %% encoding inside Tahoe.  If such a 
> name contains a ":", then when a Windows system wants to access the 
> file, it must be %U encoded.  How is the mixture avoided?
> There is no description of how to handle this case.

This case will not occur for %%-encodings generated by Tahoe,
because ':' is not a portable POSIX filename character, and so it
would be represented as %3A in the canonical %%-encoding.

It is possible for a filename such as "%%x:y" to occur other
than by canonical %%-encoding, and that case was handled as intended
in the description of Tahoe -> Unicode conversion. The corresponding
Windows filename would be "%U%0025%0025x%003Ay".

(If the escape character is changed to '@', then the equivalent example
is that the Windows filename for "@@x:y" would be "@U at 0040@0040x at 003Ay".)

> I think a scheme along these lines is workable, though, but some 
> refinements will be needed, and sufficient use cases provided to help 
> explain how the various schemes work together, once they are refined, 
> and if they do work together.

I agree; the description needs some work, but I believe the proposal
is technically sound.

-- 
David-Sarah Hopwood ⚥