[FE-discuss] UnicodeString encoding

Discussion:

Christoph Zwerschke

2009-03-12 13:05:49 UTC

FormEncode assumes 'utf-8' encoding for the UnicodeString validator.
This is usually ok, and you can even overwrite this using the
inputEncoding and outputEncoding settings. However, what you can *not*
do is have *no* input or output encoding at all, i.e. you can't use
Unicode in the "outside world", too. But this is necessary, for
instance, if you're using FormEncode with ToscaWidgets forms, because
most templating languages expect Unicode objects instead of encoded
strings. That's why ToscaWidgets comes with its own modified
UnicodeString validator. It would be nice to have this feature in
FormEncode already.

My suggestion is to allow setting inputEncoding and outputEncoding to
None. Currently, this will use the default encoding utf-8. I suggest not
decoding/encoding at all in this case, i.e. using unicode. If you don't
excplicity specify any inputEncoding or outputEncoding then, the default
utf-8 encoding will be used as before. I have already created a patch
for this feature. Can I check this in to the trunk?

-- Christoph

Ian Bicking

2009-03-12 16:22:28 UTC

Permalink

Post by Christoph Zwerschke
FormEncode assumes 'utf-8' encoding for the UnicodeString validator.
This is usually ok, and you can even overwrite this using the
inputEncoding and outputEncoding settings. However, what you can *not*
do is have *no* input or output encoding at all, i.e. you can't use
Unicode in the "outside world", too. But this is necessary, for
instance, if you're using FormEncode with ToscaWidgets forms, because
most templating languages expect Unicode objects instead of encoded
strings. That's why ToscaWidgets comes with its own modified
UnicodeString validator. It would be nice to have this feature in
FormEncode already.
My suggestion is to allow setting inputEncoding and outputEncoding to
None. Currently, this will use the default encoding utf-8. I suggest not
decoding/encoding at all in this case, i.e. using unicode. If you don't
excplicity specify any inputEncoding or outputEncoding then, the default
utf-8 encoding will be used as before. I have already created a patch
for this feature. Can I check this in to the trunk?

Changing the default behavior could mess up working code. I think it
would make sense to treat an encoding of None as do-not-encode (or
decode). A subclass of UnicodeString could have that value
(outputEncoding) default to None.

--
Ian Bicking | http://blog.ianbicking.org

Christoph Zwerschke

2009-03-12 17:07:40 UTC

Permalink

Post by Ian Bicking
Changing the default behavior could mess up working code. I think it
would make sense to treat an encoding of None as do-not-encode (or
decode). A subclass of UnicodeString could have that value
(outputEncoding) default to None.

Yes, that's exactly what my patch does: It uses NoDefault instead of
None as the default value for inputEncoding and outputEncoding, treats
None as "do no encode/decode" and falls back to the encoding set at the
class level (utf-8 for the standard class) if nothing is specified.

I've checked this in as r3805 now. Maybe we also want to add input and
output encoding default attributes on the class level? There is
currently only one common attribute for both. If we agree on the new
behaviour, I will also update the docstring and unit tests.

-- Christoph

Ian Bicking

2009-03-12 17:11:38 UTC

Permalink

> Changing the default behavior could mess up working code. I think it
> would make sense to treat an encoding of None as do-not-encode (or
> decode). A subclass of UnicodeString could have that value
> (outputEncoding) default to None.
Yes, that's exactly what my patch does: It uses NoDefault instead of
None as the default value for inputEncoding and outputEncoding, treats
None as "do no encode/decode" and falls back to the encoding set at the
class level (utf-8 for the standard class) if nothing is specified.
I've checked this in as r3805 now. Maybe we also want to add input and
output encoding default attributes on the class level? There is
currently only one common attribute for both. If we agree on the new
behaviour, I will also update the docstring and unit tests.

Ah, I see there's just the one encoding. Yeah, lets add specific
ones, with NoDefault settings so that they default to the value of
encoding (in __init__).

--
Ian Bicking | http://blog.ianbicking.org

Christoph Zwerschke

2009-03-12 17:27:44 UTC

Permalink

Post by Ian Bicking
Ah, I see there's just the one encoding. Yeah, lets add specific
ones, with NoDefault settings so that they default to the value of
encoding (in __init__).

Ok, did so in r3806. Is that what you have in mind?

-- Christoph

Christoph Zwerschke

2009-03-12 20:26:20 UTC

Permalink

I have now simplified this a bit in r3811 using the fact that
UnicodeString inherits from Declarative; and updated the docstring,
changelog and unit tests.

-- Christoph