[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue

tahoe-lafs trac at allmydata.org
Wed Apr 8 23:26:31 UTC 2009

#534: "tahoe cp" command encoding issue
     Reporter:  francois           |       Owner:  francois                          
         Type:  defect             |      Status:  assigned                          
     Priority:  minor              |   Milestone:  1.5.0                             
    Component:  code-frontend-cli  |     Version:  1.2.0                             
   Resolution:                     |    Keywords:  cp encoding unicode filename utf-8
Launchpad_bug:                     |  

Comment(by zooko):

 I'm reviewing your most recent patch, François.

 I'll be posting my observations in separate comments as I understand more
 of the patch.

 Here's the first observation:

 The patch seems to assume that the terminal handles either {{{ascii}}} or
 {{{utf-8}}} on stdout, but what about terminals that handle a different
 encoding, such as Windows {{{cmd.exe}}} (which presumably handles whatever
 the current Windows codepage is, or else {{{utf-16le}}})?  Apparently
 {{{sys.stdout.encoding}}} will tell us what python thinks it should use if
 you pass a unicode string to it with {{{print myunicstr}}} or

 In any case the documentation should explain this -- that what you see
 when you run {{{tahoe ls}}} will depend on the configuration of your
 terminal.  Hm, this also suggests that it isn't correct for tahoe to have
 a {{{unicode_to_stdout()}}} function and instead we should just rely on
 the python {{{sys.stdout}}} encoding behavior.  What do you think?

 I guess one place where I would be willing to second-guess python on this
 is, if the {{{sys.stdout.encoding}}} says the encoding is {{{ascii}}} or
 says that it doesn't know what the encoding is, then pre-encode your
 unicode strings with {{{utf-8}}} (or, if on Windows, with {{{utf-16le}}}),
 before printing them or {{{sys.stdout.write()}}}'ing them.  This is
 because of the following set of reasons:

 1.  A misconfigured environment will result in python defaulting to
 {{{ascii}}} when {{{utf-8}}} will actually work better (I just now
 discovered that my own Mac laptop on which I am writing this was so
 misconfigured, and when I tried to fix it I then misconfigured it in a
 different way that had the same result!  The first was: {{{LANG}}} and
 {{{LC_ALL}}} were being cleared out in my {{{.bash_profile}}}, the second
 was: I set {{{LANG}}} and {{{LC_ALL}}} to {{{en_DK.UTF-8}}}, but this
 version of Mac doesn't support that locale, so I had to change it to

 2.  Terminals that actually can't handle {{{utf-8}}} and can only handle
 {{{ascii}}} are increasingly rare.

 3.  If there __is__ something that can handle only {{{ascii}}} and you
 give it {{{utf-8}}}, you'll be emitting garbage instead of raising an
 exception, which might be better in some cases.  On the other hand I
 suppose it could be worse in others.  (Especially when it happens to
 produce control characters and screws up your terminal emulator...)

 I'm not entirely sure that this second-guessing of python is really going
 to yield better results more often than it yields worse results, and it is
 certainly more code, so I would also be happy with just emitting unicode
 objects to stdout and letting python and the local system config do the
 work from there.

 small details and English spelling and editing:
 s/Tahoe v1.3.1/Tahoe v1.5.0/

Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:51>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid

More information about the tahoe-dev mailing list