[tahoe-dev] File naming on POSIX and Windows clients [was: PEP 383 update: ...]

Mon May 11 01:13:53 UTC 2009

On approximately 5/10/2009 1:58 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
>  > This branch of this thread has migrated to tahoe-dev, Stephen, not 
>  > python-dev.  So you need to think about their needs if you respond here, 
>  > not the needs of Python or python-dev.
>
> Zooko asked for my comments on the protocol for translating from valid
> Unicode in Tahoe to whatever on a POSIX system, and the reverse; I
> intend to stick to that, until there's an explicit suggestion that the
> principle that "Tahoe filenames are valid Unicode" being reconsidered.
>   

Whatever.  But I fail to see how telling them that a similar problem has 
been solved elsewhere using invalid Unicode assists with that.  I think 
that keeping Tahoe filenames as valid Unicode is a good idea, but it 
does mean that the PEP 383 hack (among others) doesn't solve their problem.

>  > A PU character registry would remove from Tahoe the ability 
>  > for Tahoe clients to use PU characters for their own, actual character 
>  > purposes, which may also not be acceptable.
>
> Did you read the post where I explained how this could be done in a
> way that does *not* interfere with client use of the PUA?  This use of
> the PUA would be *entirely* internal to Tahoe (including display of
> the file names), and therefore does not encroach on clients' uses.
> (OTOH, the clients can "DoS" Tahoe by using whole planes of PU
> characters in file names, but this seems kind of unlikely.)
>   

Perhaps not.  A link to it would be welcome.  I have doubts that it can 
be used to solve the general problem, without either restricting the 
user from using some of the characters, or using one as an "escape" 
character (basically in lieu of % or some other character).

>  > >  > I question how many programs, faced with apparently URL-encoded
>  > >  > filenames, actually attempt to URL-decode the name.  Most of what
>  > >  > I've seen is that the names simply linger, containing their
>  > >  > URL-encoding, and looking ugly.
>  > >
>  > > I decode such on an ad hoc basis all the time.  I suspect other users
>  > > in non-Latin locales will do so, too.
>  >
>  > So if you have an extra layer of encoding, you will either figure out 
>  > how it works, and how and when do the appropriate decoding, or you will 
>  > do it wrong and be confused.
>
> Yes.  I think that latter case will be occur frequently for the
> proposed %%/%U/%u encoding, balancing its useful features to a great
> extent.
>   

I agree that % is not the best choice of escape character, although it 
could probably be defined well enough to make it work, for people that 
want to make it work.  I think an extra layer of encoding is unavoidable.

>  > If Tahoe enforces a consistent normalization, then it would need a 
>  > scheme for dealing with the potential duplications that could result 
>  > from file systems that don't.
>
> It does, and it does.  The point of the example is that certain types
> of use cases are likely to suffer from this a lot, even if "world
> wide" it is extremely uncommon on average.
>   

Likely so.  Just like nearly any escape character that could be picked 
would suffer in certain use cases.

>  > The solution for Rock Ridge and Joliet each seem to depend on the 
>  > flexibility of the original ISO 9660 system having an "escape" system to 
>  > allow alternate names, and each defines a rigid way of using those 
>  > alternate names.
>  > 
>  > Unfortunately, none of the file systems we are talking about do that.  
>  > Except, Tahoe _could_.
>
> In fact Tahoe can do it both internally (by adding metadata) and
> externally (by convention, eg. creating a file named TRANS.TBL in the
> same directory which maps Unicode names to original bytes).  External
> conventions are not terribly reliable, but might work in enough cases.
>   

Yes, I hadn't thought about the external metadata case, but that could 
be done, in effect forcing each system with non-Unicode names to do its 
own bookkeeping for such names, and not burdening the Tahoe server with 
such names.  External metadata can get stale, if the files are deleted 
by another system, but that is not necessarily an insurmountable problem.

>  > Remember that the %% and %u encoding proposal that we are
>  > responding to is intended to avoid the idea of fragile metadata
>  > that could get lost;
>
> The problem with the encoding proposal is that we already *have* a
> universal encoding, and it's called "Unicode".  If Unicode is not
> going to work, inventing a new universal encoding is unlikely to work
> very well either.  The best bet is to keep any complexity (such as a
> PU character registry) entirely internal to Tahoe, while making the
> external interface as simple and unambiguous as possible.
>   

The PU character registry reference is lost on me, until you provide the 
URL that describes that scheme.

The sad part is that Unicode probably _could_ work, if the correct 
encoding were retained with POSIX bytes filenames, so that the correct 
decoding could be applied.  It seems that if POSIX filesystems were 
configured or enhanced to capture that information, file name 
manipulations would be much easier.  While POSIX doesn't specify doing 
so, it seems that Mac OS X _is_ doing so, in the sense of restricting 
names to UTF-8 encoding?  And it certainly seems possible to capture 
that information at the time of file creation, but file utilities would 
also have to be enhanced to preserve it during copies of the file.

Without knowing the correct encoding, the result is, unfortunately, 
mojibake, and no additional encoding solution will make that clearer.

Since it is mojibake anyway, one could use a mojibake encoding algorithm 
such as

1) If the name decodes to Unicode successfully using the current file 
system encoding, use that name.

2) Obtain the bytes, and create a Unicode name that starts with ^^ and 
is followed by one codepoint per byte, where the codepoint is 
numerically calculated from each byte value, as  

(bytevalue is < 128 and legal in Windows filenames) ? bytevalue : 
bytevalue + 256

It would be extremely simple to code, encode, and decode, and would have 
only displayable characters.  A lookup table would work well in both 
directions.

A variation might notice any . characters in the name, and encode/decode 
each part of the name between . characters independently.  This might 
help preserve file extensions that might be in ASCII.

(In the above, I chose ^^ as an encoding indicator prefix, thinking that 
it is even more rare at the beginning of filenames than %% but again, it 
depends on the use case and type of names in any particular environment.)

> Note that "ambiguity" is not entirely determined by the quality of
> your algorithms, but also by the kinds of encoding that are used in
> the environment.
>   

Yep.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking