Discussion:
[FE-discuss] Unicode characters and htmlfill issue
Andrea Riciputi
2009-03-23 17:43:00 UTC
Permalink
Hi,
I have a problem with UnicodeString validator and htmlfill. I get a
string from my db and it has some accented characters in it. These
accented chars are represented as Unicode code points as expected.
When I pass this string through UnicodeString.from_python() methods it
is correctly encoded in UTF-8 (again, as expected).

However, when I pass it to htmlfill() to fill in a template I get the
classical UnicodeDecodeError with this traceback:

File '/Users/andrea/Documents/Work/LaMadia/Code/LaMadiaZine/
lamadiazine/controllers/customer.py', line 150 in view
c.customer))
File '/Users/andrea/Library/Python/Virtualenv/pylons-dev/lib/python2.5/
site-packages/FormEncode-1.2.1-py2.5.egg/formencode/htmlfill.py', line
78 in render
p.feed(form)
File '/Users/andrea/Library/Python/Virtualenv/pylons-dev/lib/python2.5/
site-packages/FormEncode-1.2.1-py2.5.egg/formencode/
rewritingparser.py', line 36 in feed
HTMLParser.HTMLParser.feed(self, data)
File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/HTMLParser.py', line 108 in feed
self.goahead(0)
File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/HTMLParser.py', line 148 in goahead
k = self.parse_starttag(i)
File '/System/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/HTMLParser.py', line 268 in parse_starttag
self.handle_starttag(tag, attrs)
File '/Users/andrea/Library/Python/Virtualenv/pylons-dev/lib/python2.5/
site-packages/FormEncode-1.2.1-py2.5.egg/formencode/htmlfill.py', line
273 in handle_starttag
self.handle_input(attrs, startend)
File '/Users/andrea/Library/Python/Virtualenv/pylons-dev/lib/python2.5/
site-packages/FormEncode-1.2.1-py2.5.egg/formencode/htmlfill.py', line
361 in handle_input
self.write_tag('input', attrs, startend)
File '/Users/andrea/Library/Python/Virtualenv/pylons-dev/lib/python2.5/
site-packages/FormEncode-1.2.1-py2.5.egg/formencode/
rewritingparser.py', line 76 in write_tag
if not n.startswith('form:')])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
16: ordinal not in range(128)

The unicode string (from the db) is: u"\xe0" (small "a" with grave
accent). After calling from_python() on it, I get u"\xc3\xa0" (afaik
its utf-8 counterpart). The traceback seems to suggest that something
inside htmlfill() is unware of the utf-8 encoding. But I could be
wrong... Any suggestion?

TIA,
Andrea
Marius Gedminas
2009-03-25 13:37:57 UTC
Permalink
Hi again!
Post by Andrea Riciputi
I have a problem with UnicodeString validator and htmlfill.
We discussed this on IRC
(http://pylonshq.com/irclogs/%23pylons/%23pylons.2009-03-24.log.html#t2009-03-24T09:04:00
is the horrible url) and decided that the right solution is to use
String validators of formencode.

The difference between String and UnicodeString is that String passes
through str and unicode objects unchanged, while UnicodeString always
ensures you get unicode from to_python() and always returns str from
from_python().

Pylons already takes care of str <-> unicode conversions for us, both on
the input side and on the output side, so having FormEncode do it again
causes problems.

Here's my recommendations for avoiding Unicode problems:

* make sure you get unicode objects from the database (i.e. use
SQLAlchemy's UnicodeString columns)
* always use unicode objects internally (pure-ASCII str objects don't
hurt, so pure ASCII string constants in the source code are fine --
but when you're taking user input, make sure to convert it to
unicode as soon as possible)
* do not use formencode.validators.UnicodeString
* pass unicode strings to htmlfill.render()

This way SQLAlchemy takes care of conversion at the database side, Pylons
takes care of conversion at the HTML side, and you always deal with
unicode strings.

Now, I've only been using Pylons for two months so I'm not a great
expert. If there are any holes in this logic, please poke at them!

Marius Gedminas
--
The irony is that Bill Gates claims to be making a stable operating
system and Linus Torvalds claims to be trying to take over the
world.
-- seen on the net
Christoph Zwerschke
2009-03-25 14:09:44 UTC
Permalink
Post by Marius Gedminas
* do not use formencode.validators.UnicodeString
Or, if you use it, then set outputEncoding=None, as it is possible in
the current trunk of FormEncode (hopefully will be released soon).

-- Christoph
Andrea Riciputi
2009-03-27 11:58:51 UTC
Permalink
Hi Marius,
thank you very much for your support both on IRC and here in the list.
I've done some experiments with String() and UnicodeString()
validators and I think to have thoroughly understood your explanation
of what is going on.

However, during these experiments, I've found a little glitch in the
String() validator behaviour. As you pointed out String() leaves the
strings untouched. In fact, this is not completely true. Just try:

assert String().to_python(u'') is u''
assert String().from_python(u'') is u''

You get in both cases an assertion exception since, both calls to .to/
from_python() returns '' (and not u'' as expected).

I agree that returning an empty basestring instead of the empty
unicode string is not a great problem. However, I still think it
should be fixed anyway. Any comment?

Cheers,
Andrea
Post by Marius Gedminas
Hi again!
Post by Andrea Riciputi
I have a problem with UnicodeString validator and htmlfill.
We discussed this on IRC
(http://pylonshq.com/irclogs/%23pylons/%23pylons.2009-03-24.log.html#t2009-03-24T09
:04:00
is the horrible url) and decided that the right solution is to use
String validators of formencode.
The difference between String and UnicodeString is that String passes
through str and unicode objects unchanged, while UnicodeString always
ensures you get unicode from to_python() and always returns str from
from_python().
Pylons already takes care of str <-> unicode conversions for us, both on
the input side and on the output side, so having FormEncode do it again
causes problems.
* make sure you get unicode objects from the database (i.e. use
SQLAlchemy's UnicodeString columns)
* always use unicode objects internally (pure-ASCII str objects don't
hurt, so pure ASCII string constants in the source code are fine --
but when you're taking user input, make sure to convert it to
unicode as soon as possible)
* do not use formencode.validators.UnicodeString
* pass unicode strings to htmlfill.render()
This way SQLAlchemy takes care of conversion at the database side, Pylons
takes care of conversion at the HTML side, and you always deal with
unicode strings.
Now, I've only been using Pylons for two months so I'm not a great
expert. If there are any holes in this logic, please poke at them!
Marius Gedminas
--
The irony is that Bill Gates claims to be making a stable operating
system and Linus Torvalds claims to be trying to take over the
world.
-- seen on the net
------------------------------------------------------------------------------
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM)
are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly
and
easily build your RIAs with Flex Builder, the Eclipse(TM)based
development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com_______________________________________________
FormEncode-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/formencode-discuss
Marius Gedminas
2009-03-27 23:59:55 UTC
Permalink
Post by Andrea Riciputi
However, during these experiments, I've found a little glitch in the
String() validator behaviour. As you pointed out String() leaves the
assert String().to_python(u'') is u''
assert String().from_python(u'') is u''
You get in both cases an assertion exception since, both calls to .to/
from_python() returns '' (and not u'' as expected).
I agree that returning an empty basestring instead of the empty unicode
string is not a great problem. However, I still think it should be fixed
anyway. Any comment?
I'll just note in passing that doing identity checks on strings (x is y)
is a bad idea.

Does the behaviour that you observed cause problems in practice?

Marius Gedminas
--
1 + 1 = 3
-- from a Microsoft advertisement
Loading...