[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue

Tue Apr 28 16:08:48 UTC 2009

#534: "tahoe cp" command encoding issue
-----------------------------------+----------------------------------------
     Reporter:  francois           |       Owner:  francois                          
         Type:  defect             |      Status:  assigned                          
     Priority:  minor              |   Milestone:  1.5.0                             
    Component:  code-frontend-cli  |     Version:  1.2.0                             
   Resolution:                     |    Keywords:  cp encoding unicode filename utf-8
Launchpad_bug:                     |  
-----------------------------------+----------------------------------------

Comment(by zooko):

 Hm, so there is this idea by Markus Kuhn called {{{utf-8b}}}.  {{{utf-
 8b}}} decoding is just like {{{utf-8}}} decoding, except that if the input
 string turns out not to be valid {{{utf-8}}} encoding, then {{{utf-8b}}}
 stores the invalid bytes of the string as invalid code points in the
 resulting unicode object.  This means that
 {{{utf8b_encode(utf8b_decode(x)) == x}}} for any {{{x}}} (not just for
 {{{x}}}'s which are {{{utf-8}}}-encodings of a unicode string).

 I wonder if {{{utf-8b}}} provides a simpler/cleaner way to accomplish the
 above.  It would look like this.  Take the design written in
 http://allmydata.org/trac/tahoe/ticket/534#comment:47 and change step 2 to
 be like this:

 2. On Linux read the filename with the string APIs to get "bytes" and call
 {{{sys.getfilesystemencoding()}}} to get "alleged_encoding".  If the
 alleged encoding is {{{ascii}}} or {{{utf-8}}}, or if it absent or invalid
 or denotes a codec that we don't have an implementation for, then set
 {{{alleged_encoding = 'utf-8b'}}} instead. Then, call
 {{{bytes.decode(alleged_encoding, 'strict')}}} to try to get a unicode
 object.

 2.a. If this decoding succeeds then normalize the unicode filename with
 {{{filename = unicodedata.normalize('NFC', filename)}}}, store the
 resulting filename and if the encoding that was used was ''not'' {{{utf-
 8b}}} then store the alleged_encoding.  (If the encoding that was used was
 {{{utf-8b}}}, then don't store the alleged_encoding -- {{{utf-8b}}} is the
 default and we can save space by omitting it.)

 2.b. If this decoding fails, then we decode it with {{{bytes.decode('utf-
 8b')}}}. Do not normalize it. Put the resulting unicode object into the
 "filename" part.  Do not store an "alleged_encoding".

 Using {{{utf-8b}}} to store bytes from a failed decoding instead of
 {{{iso-8859-1}}} means that if the name or part of the name is actually
 {{{ascii}}} or {{{utf-8}}}, then it will be (at least partially) legible.
 It also means that we can omit the "failed_decode" flag, because it makes
 no difference whether the filename was originally alleged to be in
 {{{koi8-r}}}, but failed to decode using the {{{koi8-r}}} codec, and so
 was instead decoded using {{{utf-8b}}}, or whether the filename was
 originally alleged to be in {{{ascii}}} or {{{utf-8}}}, and was decoded
 using {{{utf-8b}}}.  (Right?  I think that's right.)

 An implementation, including a Python codec module, by Eric S. Tiedemann
 (1966-2008; I miss him):
 http://hyperreal.org/~est/utf-8b

 An implementation for GNU iconv by Ben Sittler:
 http://bsittler.livejournal.com/10381.html

 A PEP by Martin v. Löwis to automatically use {{{utf-8b}}} whenever you
 would otherwise use {{{utf-8}}}:
 http://www.python.org/dev/peps/pep-0383

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:58>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid