Forum OpenACS Q&A: Getting started: adding fields to registration screen

I'm completely new to ACS. I'm going to gradually build a site with a lot of custom functionality, but right now I want to publish the first version and am facing two simple tasks. Please bear with me: I've been reading the docs for quite a while and am still puzzled.
  • I need to add three mandatory fields to the user information: two text fields (first & family name in Russian spelling) and an item picked from a list (user location). The items should be asked for at registration and should be editable in user's workspace.

    How to add them? What files to edit and how to add the columns to the users table?

  • I need to change the default encoding of all HTML pages to Windows-1251.

    The stuff ... text/html; charset=windows-1251 need be listed both in server response and in the HEAD section of HTML source of all and every page; what should I modify to achive this?
    Are there any problems with PostgreSQL if I feed it characters in 128-255 code range (in fields, not in column names)?
Is it an OK workflow to modify the files on the development server, debug it there and then copy everything to the production one? If I'm developing alone, can I get away with not using CVS for some time?

If I modify the files that make log in, workspace, etc. in the core, does that mean I wouldn't be able to upgrade to a new release, or there's a way around it?

I need to add three mandatory fields to the user information

You can create a separate table for this and add the sql for the table creation wherever it makes the most sense for you. Key that table on object_id. OR you could modify the default object creation code and add these as attributes of the user object. If you don't ever think you will have to add other attributes to the object on a production server after you deploy this would be the way to go.

text/html; charset=windows-1251 need be listed both in server response and in the HEAD section

For the head section, modify the default-master template. I'm not sure about the server response.

Are there any problems with PostgreSQL if I feed it characters in 128-255 code range

Make sure your postgres installation was created with the right charset. I don't know them offhand. Or are they all 8 bit nowadays?

Is it an OK workflow to modify the files on the development server, debug it there and then copy everything to the production one? If I'm developing alone, can I get away with not using CVS for some time?

I recommend going with CVS right from the beginning. You can tag your files when you push to production so if you find something that worked on development but is hosing production you can roll back.

If I modify the files that make log in, workspace, etc. in the core, You always run that risk. Read these:

  • https://openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=00036T&topic_id=11&topic=OpenACS
  • http://www.cvshome.org/docs/manual/cvs_13.html#SEC104
  • https://openacs.org/new-file-storage/download/cvs.html?version_id=140
  • http://www.piskorski.com/cvs-conventions.html
for people's musings on this.
I need to change the default encoding of all HTML pages to Windows-1251

See also encodings.html that comes stock with AOLServer distro, normally in the root dir. Not sure if it would fit below, but heck --worth a try...

Character Encoding in AOLserver 3.0 and ACS

by Rob Mayoff


This is a work in progress. It is incomplete, inconsistent, and subject to radical change.

Note that this document applies only to the Tcl 8 version of AOLserver, also known as nsd8x, because Tcl 7 has no internationalization support. This document is also mainly concerned with the AOLserver Tcl API, because that is what we use at ArsDigita. There are probably problems in the C API as well that are not covered here.

Parts of this document where I am not sure of something and am specifically seeking advice have a blue background like this. If you have any feedback please e-mail me.

Contents

The Problem

Here's a simple example of the problem: you have a file on disk, named "hello.html" and stored using the ISO-8859-1 encoding:

Hello. My name is Günther Schmidt.

(That should say "Gunther" with an umlaut on the "u".) Since it's in ISO-8859-1 encoding, the u with umlaut is stored as one byte with value xFC. Suppose you send this file to the user using this script:

set fd [open /web/pages/hello.html r]
set content [read $fd [file size /web/pages/hello.html]]
close $fd
ns_return 200 text/html $content

Then the user will probably see this:

Hello. My name is Günther Schmidt.

(That should say "Gunther" with an umlaut on the "u".) But suppose you send this file using this script:

set fd [open /web/pages/hello.html r]
set content [read $fd [file size /web/pages/hello.html]]
close $fd
regsub {Hello.} $content {Hello!} content
ns_return 200 text/html $content

Then the user will probably see this:

Hello! My name is Günther Schmidt.

(That should say "GA1/4nther" with a tilde on the "A" and the "1/4" as a fraction.) What happened? The reason it worked in the first case is that by default, AOLserver just ships out the raw bytes from the (ISO-8859-1-encoded) file, and the HTTP standard says that the client must assume a charset of ISO-8859-1 if no other charset is specified. The file encoding and the browser encoding matched, and AOLserver sent the data unmodified, so everything worked.

The second case is different. It turns out that Tcl 8.1 and later use Unicode. The interpreter normally stores strings using the UTF-8 encoding (which uses a variable number of bytes per character), and sometimes converts them to UCS-2 encoding (which uses 16-bit "wide characters"). The regsub command is one of those cases where conversions are involved. First, regsub converted the string to UCS-2. Tcl's UTF-8 parser is lenient, so the transformation ended up translating xFC into x00FC. (This happens to be the correct translation because UCS-2 is a superset of ISO-8859-1.) Then regsub did its matching and substitution. Then it converted the UCS-2 representation back to UTF-8. The UTF-8 encoding of x00FC is xC3 xBC. AOLserver does not know anything about UTF-8; it just sends whatever bytes you give it. In ISO-8859-1, xC3 means à and xBC means ¼.

So regsub didn't do anything wrong. We gave it garbage (a non-UTF-8 string), so it gave us garbage. How do we solve this problem? We need to make sure that all of AOLserver's textual input is translated to its UTF-8 representation and that the UTF-8 is translated to the appropriate character encoding on output.

Terminology

A character encoding is a mapping from a set of characters to a set of octet sequences. US-ASCII maps all of its characters to a single octet each. UTF-8 maps its characters to a variable number of octets each.

Charset is synonymous with "character encoding"; Internet standards use this term.

Tcl 8.1 and later use Unicode and UTF-8 internally and include support for converting between character encodings. The Tcl names for various encodings are different than the Internet standard names. So, in this document, I typically use the term "encoding" when I am referring to Tcl, and "charset" when I am referring to an Internet protocol feature.

Database Access

For database access, the only sane choice is to use a database that supports UTF-8. Then Tcl strings can be passed to and from the database client library unmodified. Trust me, you just want to use a UTF-8 database.

Configuration Files

AOLserver reads its configuration files (both Tcl and ini-format) with no character encoding translation. This means that you must store AOLserver configuration files in UTF-8.

Configuration Files

AOLserver supports Tcl source files in your Tcl library and under your PageRoot. In either case, it reads the files using the Tcl "source" command, which uses the Tcl "system encoding" when it reads the files. In AOLserver, the system encoding is UTF-8. Therefore you must store your Tcl source in UTF-8 format. The simplest strategy (if you do not have a UTF-8 editor) is to use only US-ASCII bytes, and represent any other characters using the xXX notation (for any ISO-8859-1 character) or the uXXXX notation (for any Unicode character).

Content Files

By "content file", I mean a file containing data to be sent to the client, not a file containing a program to be run. So an HTML or JPEG file is a content file, but a Tcl script is not.

AOLserver has several APIs for sending the contents of a file directly to the client. All of them send the contents of the file back to the client unmodified - no character encoding translation is performed. This means that it is up to you to ensure that the file's encoding is the same as the encoding the client expects.

The safest thing is to use only US-ASCII bytes in your text files - bytes with the high bit clear. Just about every character encoding you're likely to run across on the Web will be a superset of US-ASCII, so no matter what charset the client is expecting, your content will probably be displayed correctly. If you are sending an HTML (or XML) file, it can still access any Unicode character using the &#nnn; notation. However, if you have non-HTML files, or you don't want to deal with all those character reference entities, you'll have to make sure your client knows what character set you're sending.

The client knows what character set to expect from the Content-Type header. You're probably used to seeing a header like this:

Content-Type: text/html
In fact, the header can specify a charset like this:
Content-Type: text/html; charset=iso-8859-1
If the header does not include a charset parameter, the HTTP standard specifies that the character set must be ISO-8859-1. In practice, though, clients may try to guess a character set, or they may let the user override the default character set. So you should always specify a character set for text content.

Typically, you determine the content-type to send for a file by calling ns_guesstype on it. ns_guesstype looks up the file extension in AOLserver's file extension table to pick the content-type. The default table is in the AOLserver manual. Some of the default mappings are:

Extension Type
.html text/html
.txt text/plain
.jpg image/jpeg
As you can see, no charset is specified for the text file types. That means that you can't predict how your text will appear on the user's screen unless you stick to US-ASCII bytes in your files. So you should override the mappings in your AOLserver config file. For example, if all your text files use the ISO-8859-1 encoding, you should put this in your config file:
nsd.ini nsd.tcl
[ns/mimetypes]
.html=text/html; charset=iso-8859-1
.txt=text/plain; charset=iso-8859-1
ns_section ns/mimetypes
ns_param .html "text/html; charset=iso-8859-1"
ns_param .txt "text/plain; charset=iso-8859-1"
But all your text files might not use the same encoding. If you have files in various encodings, you need to make up extensions to identify the different encodings, rename your files accordingly, and map the extensions in your config file. For example:
nsd.ini nsd.tcl
[ns/mimetypes]
.html=text/html; charset=iso-8859-1
.txt=text/plain; charset=iso-8859-1
.html_sj=text/html; charset=shift_jis
.txt_sj=text/plain; charset=shift_jis
.html_ej=text/html; charset=euc-jp
.txt_ej=text/plain; charset=euc-jp
ns_section ns/mimetypes
ns_param .html "text/html; charset=iso-8859-1"
ns_param .txt "text/plain; charset=iso-8859-1"
ns_param .html_sj "text/html; charset=shift_jis"
ns_param .txt_sj "text/plain; charset=shift_jis"
ns_param .html_ej "text/html; charset=euc-jp"
ns_param .txt_ej "text/plain; charset=euc-jp"
If you wish to translate the contents of a file to some other charset when you send it, you can use Tcl's file handling:
set fd [open somefile.html_sj r]
fconfigure $fd -encoding shiftjis
set html [read $fd [file size somefile.html_sj]]
close $fd
ns_return 200 "text/html; charset=euc-jp" $html

XXX ACS: ad_serve_html_file

Output from Tcl

Your Tcl programs (Tcl files, filters, and registered procs) can send content to the client using a number of commands:
  • ns_writefp
  • ns_connsendfp
  • ns_returnfp
  • ns_respond
  • ns_returnfile
  • ns_return (and variants like ns_returnerror)
  • ns_write
The commands for sending files are discussed under Content Files.

Tcl stores strings in memory using UTF-8. However, when you send content to the client from Tcl, you may not want the client to receive UTF-8; he may not support it. So AOLserver can translate UTF-8 to a different charset.

If you use ns_return or ns_respond to send a Tcl string to the client, AOLserver determines what character set to use by examining the content type you specify:

  1. If your content-type includes a charset parameter, then AOLserver translates the string to that charset.
  2. Otherwise, if your content-type is text/anything, then AOLserver translates the string to the charset specified in the config file by ns/parameters/OutputCharset (iso-8859-1 by default).
  3. Otherwise, AOLserver sends the string unmodified.

In the second instance, where AOLserver uses the ns/parameters/OutputCharset, if ns/parameters/HackContentType is also set to true, then AOLserver will modify the Content-Type header to include the charset parameter. HackContentType is set by default, and I strongly recommend leaving it set, because it's always safer to tell the client explicitly what charset you are sending.

For example, the default configuration is equivalent to this:

[ns/parameters]
OutputCharset=iso-8859-1
HackContentType=true
So if you run this command:
ns_return 200 text/html $html
This header will be sent:
Content-Type: text/html; charset=iso-8859-1
And the contents of $html will be converted to the ISO-8859-1 encoding as they are sent to the client.

If you write the headers to the client with ns_write instead of letting AOLserver do it (via ns_return or ns_respond), then AOLserver does not parse the content-type. You must explicitly tell it what charset to use immediately after you write the headers, by calling ns_startcontent in one of these forms:

ns_startcontent
Tells AOLserver that you have written the headers and do not wish the content to be translated.
ns_startcontent -charset charset
Tells AOLserver that you have written the headers and wish the following content to be translated to the specified charset.
ns_startcontent -type content-type
Tells AOLserver that you have written the headers and wish the following content to be translated to the charset specified by content-type, which should be the same value you sent to the client in the Content-Type header. If content-type does not contain a charset parameter, AOLserver translates to ISO-8859-1.
The client may specify what charsets in accepts by sending an Accept-Charset header in its HTTP request. If the Accept-Charset header is missing, then the client is assumed to accept any charset. The ns_choosecharset command will return the best charset to use, taking into account the Accept-Charset header and the charsets supported by AOLserver. The syntax is
ns_choosecharset ?-preference charset-list?

The ns_choosecharset algorithm:

  1. Set preferred-charsets to the list of charsets specified by the -preference flag. If that flag was not given, use the config parameter ns/parameters/PreferredCharsets. If the config parameter is missing, use {utf-8 iso-8859-1}. The list order is significant.
  2. Set acceptable-charsets to the intersection of the Accept-Charset charsets and the charsets supported by AOLserver.
  3. If acceptable-charsets is empty, return the charset specified by config parameter ns/parameters/DefaultCharset, or iso-8859-1 by default.
  4. Choose the first charset from preferred-charsets that also appears in acceptable-charsets. Return that charset.
  5. If no charset in preferred-charsets also appears in acceptable-charsets, then choose the first charset listed in Accept-Charsets that also appears in acceptable-charsets. Return that charset.
  6. (Note: the last step will always return a charset because acceptable-charsets can only contain charsets listed by Accept-Charsets.)

Example:

# Assume japanesetext.html_sj is stored in Shift-JIS encoding.
set fd [open japanesetext.html_sj r]
fconfigure $fd -encoding shiftjis
set html [read $fd [file size japanesetext.html_sj]]
close $fd

set charset [ns_choosecharset -preference {utf-8 shift-jis euc-jp iso-2022-jp}]
set type "text/html; charset=$charset"
ns_write "HTTP/1.0 200 OK
Content-Type: $type

"
ns_startcontent -type $type
ns_write $html

URL Encoding

Whether a URL is made up of "characters" or "bytes" is a complex issue (see RFC 2396 for details). Ultimately, though, URIs are transmitted over the network, so they must be reduced to bytes. However, HTTP limits the set of bytes used to transmit a URL. URLs containing bytes outside that set must be encoded for transmission.

In URL encoding, one byte may be encoded as three bytes which in US-ASCII represent a percent character ("%") followed by two hexadecimal digits.

After a URL is decoded, any bytes less that x80 represent US-ASCII characters. The problem with URLs and URL encoding is that historically, no standard defined what bytes larger than x80 represent. Various proposals such as IURI Internet-Draft propose using UTF-8 exclusively as the character encoding in URLs, but existing software does not work that way.

AOLserver's ns_urlencode and ns_urldecode choose the character encoding to use in one of three ways:

  1. If the command was invoked with a -charset flag, use that charset. For example:
    ns_urlencode -charset shift_jis "u304b"
    Unicode character U+304B is HIRAGANA LETTER KA. In Shift-JIS this is encoded as x82 xA9, so the command returns the string "%82%A9".
  2. If no -charset flag was given, then the ns_urlcharset command determines what encoding is used. The ns_urlcharset sets the default charset for the ns_urlencode and ns_urldecode commands for one connection. For example, these commands have the same result as the preceding example:
    ns_urlcharset shiftjis
    ns_urlencode "u304b"
    The ns_urlcharset command is only valid when called from a connection thread. Do not call it from an ns_schedule_proc thread.
  3. If neither of the preceding steps specified a charset, then the AOLserver config parameter ns/parameters/URLCharset determines the charset. The default value for the parameter is "iso-8859-1".

A URL, as seen by AOLserver in an HTTP request, consists of two parts, the path and the query. For example:

/register/user-new.tcl
path
? first_names=Rob&last_name=Mayoff
query

We will consider the path part and the query part separately.

URL Path

AOLserver decodes the path part of the URL in the HTTP request before determining how to handle the URL. It does not run any Tcl code in the connection thread first, so AOLserver always uses the charset specified by ns/parameters/URLCharset to decode the path. You must use the same charset to encode URLs you send out, or you will have problems.

However, other people might link to you from their servers and might be careless about the character encodings. So the safest practice is to use only US-ASCII characters in your URL paths if you possibly can.

Form Data in application/x-www-form-urlencoded Format

Form data comes from one of two places:
  • In an HTTP GET request, the query data is the part of the request URL following the first x3F byte (the first question mark). This data is always in application/x-www-form-urlencoded format.
    Okay, it could be raw data from an <ISINDEX> page, but that tag is deprecated in HTML 4.0. Let's simplify our lives by pretending it doesn't exist.
  • In an HTTP POST request, the query data is the request contents, following the request header. By default this data is in application/x-www-form-urlencoded format. The other format is covered under POST Data in multipart/form-data Format.
Either way, form data is URL-encoded for transmission. AOLserver has no standard way to determine what charset the browser used to encode the data. Typically the browser uses whatever charset the HTML page containing the form was in. If the HTML page was sent without a charset in the Content-Type header, then the browser is supposed to use ISO-8859-1, but browsers often guess or let the user override that default. Always specify a charset when you send a text document to avoid this problem.

If you always send data in a single charset, and you always specify the charset in the Content-Type header, then it is safe to assume that form data is always encoded using that charset. Just make that your ns/parameters/URLCharset and don't worry about it.

If you cannot limit yourself to a single charset, then you need to use some other technique. No matter how you do it, you must call ns_urlcharset before calling ns_conn form or ns_getform. If you call ns_urlcharset after you've asked AOLserver for the form, it will not work retroactively.

Here are two ways you could determine the charset:

  • Include a hidden field in all your forms, to indicate the charset. Example:
    # myform.tcl set _charset [ns_choosecharset] ns_return 200 "text/html; charset=$_charset" " <form action='myform-2.tcl'> <input type='hidden' name='_charset' value='$_charset'> First Names: <input type='text' name='first_names'><br> Last Name: <input type='text' name='last_name'><br> <input type='submit' name='submit' value='Submit'> </form> "
    The chicken-and-egg problem here is that you need the contents of a form field in order to decode the form. Fortunately, all charset names use only US-ASCII characters, so you can extract the _charset field from the query string without decoding it. The predefined command ns_formfieldcharset will do this for you:
    # myform-2.tcl ns_formfieldcharset _charset set form [ns_conn form] set first_names [ns_set get $form first_names] set last_name [ns_set get $form last_name] etc.
    ns_formfieldcharset calls ns_urlcharset, so this will affect all further use of ns_urlencode and ns_urldecode for that connection, unless you call ns_urlcharset again.

  • Use a cookie to store the last charset you sent to the user. Example:
    # anotherform.tcl set _charset [ns_choosecharset] ns_set put [ns_conn outputheaders] Set-Cookie _charset=$_charset ns_return 200 "text/html; charset=$_charset" " <form action='anotherform-2.tcl'> First Names: <input type='text' name='first_names'><br> Last Name: <input type='text' name='last_name'><br> <input type='submit' name='submit' value='Submit'> </form> "
    There is no chicken-and-egg problem here, but AOLserver still provides the predefined command ns_cookiecharset to set the URL encoding from a cookie:
    # myform-2.tcl ns_cookiefield _charset set form [ns_conn form] set first_names [ns_set get $form first_names] set last_name [ns_set get $form last_name] etc.
    Using a cookie has the big drawback that a cookie is not associated with a single web page. So if the user uses his back button, or has a page cached, or has multiple windows open, the wrong cookie value might be sent back to us.

Form Data in multipart/form-data Format

The browser sends data in multipart/form-data format when the FORM tag says enctype='multipart/form-data'. This format is based on the MIME standard and allows file upload (which application/x-www-form-urlencoded does not).

Alas, multipart/form-data format is no better than application/x-www-form-urlencoded format as far as character encoding issues are concerned. The MIME multipart format allows each form field to include its own Content-Type header with a charset parameter, but in practice clients do not send any indication of the charset used. So we must resort to the same tricks to decide what charset the data is in: always use the same charset, or use a hidden field or a cookie to determine the charset.

The ns_formfieldcharset and ns_cookiecharset commands work for fields in multipart/form-data format except file upload fields. We cannot know what character set the user stores his files in, so we don't know how to translate an uploaded file to utf-8 (assuming the uploaded file is even a text file). So the temporary files created by ns_getform contain the exact bytes sent by the client.

If you hand non-UTF-8 data to the Oracle client library when it thinks you are handing it UTF-8 data, it may crash. So when you are inserting an uploaded file into a CLOB, it is imperative that you run the file contents through Tcl's encoder first. I have not figured out a satisfactory way to automate this yet.

Cookies

The browser should not mess with cookie values; it should just send back exactly the bytes you sent it. However, it is common to URL-encode cookie values that might otherwise have unsafe characters in them. You need to be careful to use the same character encoding for encoding and decoding cookie values.

ns_httpopen / ns_httpget

The ns_httpopen command now parses the Content-Type header from the remote server and sets the encoding on the read file descriptor appropriately. If the content from the remote server is a text type but no charset was specified, then ns_httpopen uses the config parameter ns/parameters/HttpOpenCharset, which specifies the charset to assume the remote server is sending (iso-8859-1 by default).

References