Forum OpenACS Q&A: Response to Dealing with non-Roman character sets

Collapse
Posted by Neophytos Demetriou on
I have been using openacs-4 with aolserver-3.3.1+ad13 (which AFAIK has internationalization support based on Mayoff and Minsky's patches) for about six months now and it works great. In my case aolserver is configured to convert content into utf-8 as soon as possible (when data is transferred from the client) and keep it in utf-8 in the database for as long as possible (until data is transferred to the client). Maintaining the data in utf-8 at server-side and transferring it back to the user is the easy part. The difficult part, however, is that you need to know the encoding of the text submitted through a text form (as Henry have already pointed out above). If you were using an ISO-8859 character set for transferring content to the client then I would not expect any serious problems (at least this is the case for ISO-8859-7, i.e. Greek). This is based on the fact that the user is more likely to submit her text in the same ISO-8859 character set as the one used when you transferred the page to it's browser. In that case aolserver will convert the data back to utf-8 as *required*. In your case though, you have to transfer the page to the client in utf-8 since you want a trully multilingual documents (instead of bilingual which is what you get with ISO-8859 character sets). Let my just say that I'm not an expert on this stuff and I would appreciate if Henry or anybody else could verify this information.

Also, have in mind that openfts-tcl *cannot* be used with utf-8 documents, as is. I got it working with utf-8 documents by modifying the parser and by only using the default dictionary (UnknownDict -- no stemming, no stopwords, exact matching). In order to make openfts handle utf-8 documents using more than one dictionary, i.e. using Porter's algorithm for English and a morphology-based dictionary for some other language, is a more complicated process and eventhough I have an idea of how it can be done I did not have the chance to try it yet.