Forum OpenACS Q&A: a problem with tcl and ascii > 127

Collapse
Posted by Jonathan Ellis on
I want to display an umlauted e. I find that chr(235) provides the desired character in postgresql when displayed in a web browser: [db_string foo "select chr(235)"].

But I have this helper proc that I use pervasively:

proc_doc ad_space_to_nbsp { s } {
    replaces all instances of " " in s with " "
} {
    return [string map [list { } { }] $s]
}
ad_space_to_nbsp [db_string foo "select chr(235)"]

corrupts my e into something funky. If I write _to_nbsp as a regsub instead I get the same result.

What's wrong? Is there a quick fix?

Collapse
Posted by Jonathan Ellis on
I have put a demo script up here.

The scripts contents follow:

set s [db_string foo "select chr(235)"]
set s2 [ad_space_to_nbsp $s] ;# shouldn't change $s b/c it contains no spaces

ns_return 200 text/html "
[jbe_header test]

$s

$s2

[string equal $s $s2] [ad_footer] "

"string equal" thinks they are the same when they result in quite different characters to both NN and IE.
Collapse
Posted by Simon Buckle on
It looks like a character conversion problem. Tcl 8.1 and later stores strings internally as Unicode, specifically UTF-8 although it appears that some commands, such as regsub, convert between UCS-2 and UTF-8.

The UCS-2 value of ë is 0x00EB. The UTF-8 encoding of 0x00EB is xC3 xAB. In ISO-8859-1, xC3 is  and xAB is « which are the characters that are displayed in the demo script you linked to.

If you are sending out the raw bytes, try setting the character set in the Content-Type header, e.g

ns_return 200 "text/html; charset=utf-8" ...


Have a look at this document for more information.
Collapse
Posted by Jonathan Ellis on
Thanks, that helps.  Specifying charset=utf-8 does work.

It's a less than ideal solution though for my purposes for two reasons.  First, it behaves differently under rl_returnz: there, $s2 displays correctly, but $s is something funky.  (Both display the umlauted e with vanilla ns_return.)  There are many places where I'm just pulling strings out of the database and writing them out so funkifying those is a poor option. :)  And I'd prefer to avoid experimenting with recreating my database with a different encoding.

Second, I'd rather not have to go change all my text/html to "text/html; charset=utf8", and I'd also rather not write a wrapper that ignores the content-type.

So although I admit it's ugly as hell I'd rather modify my ad_space_to_nbsp proc to keep the same encoding.  There may be one or two other procs that use regsub et al that I'd need to modify but not very many.  Is there a way to do this?  I tried

[encoding convertto iso8859-1 [ad_space_to_nbsp $s]]

but that didn't help.  probably not surprisingly to you. :)

Collapse
Posted by Jeff Davis on
Which version of rl_returnz are you using?  I have a version
modified to be encoding aware which might fix this; the
vanilla version does not bother to change the internal encoding to whatever external encoding you have set.
Collapse
Posted by Jamie Rasmussen on
Jeff, do you know if your encoding-aware changes made it into the SourceForge copy of rl_returnz (renamed nsreturnz)?
Collapse
Posted by Jonathan Ellis on
I'm using a version I got from Jerry Asher a while ago, patched with my fix for the IE 5/6 refresh bug, and modified slightly for nsd 4.  I don't see any code for handling various encodings in it.
Collapse
Posted by Jeff Davis on
It's not at sourceforge; it is available at http://xarg.net/code/ (although I never tried to make it
work with aolserver4).

Jonathan, can you send me a diff of the changes you needed to
make for aolserver4?

Collapse
Posted by Jonathan Ellis on
I'm pretty sure all I had to do was change Ns_ConnReturnData to Ns_ConnReturnRawData.