Forum OpenACS Q&A: Globalization

Collapse
Posted by Yonatan Feldman on

One really import point for us in Europe is the support for multiple languages. Please note that most (if not all) major commercial learn platforms made the mistake of not being localizable from the start, and are paying for this now (figuratively and literally). I just wanted to mention this, because I doubt there are many foreign language courses at Sloan or Berklee, internationalization is easy to forget about when on a tight schedule, and working on it after the fact is a lot harder than when it is considered from the start (not to mention that making something support multiple languages at a later stage requires more coordination and communication than if it is done from the start).

This is important.

Finding out that OpenACS is the platform Greenpeace has chosen, eases my multilingual concerns on the OpenACS side (although it does not eliminate them). I just hope that the multilingual features of OpenACS and dotLRN will "dovetail" as so elegantly stated by Don. Obviously, the best solution would be one that is independent of dotLRN and built into OpenACS, making multilingual support a snap.

Can somebody that knows both systems and has experience in internationalization please comment on this? Who is working on this?

-- Carl Robert Blesius, May 9, 2002

Collapse
2: Response to Globalization (response to 1)
Posted by Jon Griffin on
acs-reference was made with globalization in mind. It isn't the solution only the framework. I know Henry Minsky wrote an acs-lang package but that is only one aspect and should probably be updated to use acs-reference et al.
Collapse
3: Response to Globalization (response to 1)
Posted by Yonatan Feldman on
you are right, including globalization in the design of an application, or even more of a web-application infrastructure, makes things much simpler and cleaner. since OACS has already been designed (cough, cough), and mostly implemented, we'll just have to suffer through the pain of building it in after the fact.

i built the globalization infrastructure for ACS Java, so i have some experience with this stuff. i think we should use this thread to design and plan (logistically) the retrofit of globalization into OACS.

i am not familiar with the globalization infrastructure in OACS currently, i know of acs-reference, but not of its capabilities. why don't we start by cataloging everything that is in OACS currently and how we can use it.

Collapse
4: Response to Globalization (response to 1)
Posted by Yonatan Feldman on
i see the jon preempted me :)
Collapse
5: Response to Globalization (response to 1)
Posted by Don Baccus on
Let me explain briefly what's happening at Greenpeace ...

Greenpeace is making heavy use of a derivative of acs-lang.  Bruno Mattarollo (currently on two days' holiday so he probably won't pop up here after all) has modified it to key messages to package and locale, rather than to language.  Keying to package allows for the generation of translator pages on a per-package basis (rather than burying them in a system-wide sea of messages with no organization).  Keying to package/page might be better in general but the current approach suits GP's needs just fine.  Keying to locale rather than language allows for the use of different dialects (American English, etc).

Also "translation missing" messages are turned into a hyperlink to an "enter translation" page if the user has privs.  I think this was part of the original acs-lang, but it's possible Bruno added it.

Only user-visible pages are being internationalized.  Admin pages specific to GP as well as standard ACS admin pages aren't internationalized.

Lars Pind added a "trn" tag to the template system and implemented a simple "#msg_key#" replacement syntax similar to the "@tcl_var@" construct.  This means that rather than tediously call "_" ( "[_ language_id msg_key]") and reference the result via @...@, you can just reference the key directly.  I modified his first "trn" implementation to allow variable substitution, so you can use it (for instance) with multirow queries that pull message keys from the db.

I've done a bunch of work to track current language and nationality via smart URLs and (if the user has them enabled) cookies, but this is more specific to GP's needs than the previous two items.

It all works quite well.  acs-lang caches a message-key/message pair, so it's fast.  The problem, of course, is that there's one more layer in the mix.  Tcl->Adp layout->message contents.  What was aD's solution for ACS Java, Yon?

When all is said and done, Bruno's anxious to get his modified acs-lang back in the mix.  It will take some work because to some degree it is specific to GP, and of course we would want some public discussion to see just how folks might want this to work in a more generalized context.  My guess is that it will be late summer before there's time to do this, but I can't really speak for Bruno.  Maybe he has Secret Plans to add more hours to the day and do it earlier :)

Lars's changes can go in at any time.  He already started a thread on a new form of the "include" tag, and was going to add it to the tree.    If this is a good time to discuss his "trn" and "#msg_key#" template extensions, let's do it.  They can go into the development branch of the tree at any time.

In the GP case, then, pages are designed as Tcl->Adp layout->message entities.  Standard OpenACS package pages aren't constructed this way.  Should they be?  Should the standard package Adp templates be simple Tcl->Adp entities as they are today, with alternative Adp templates made available as replacements?

Collapse
6: Response to Globalization (response to 1)
Posted by Don Baccus on
Of course, globalization isn't just language, as Jon points out acs-references contains data necessary for other globalization issues.

Timezone, for instance.  The current acs-datetime package has hooks for timezones but makes no use of them.  Whatever is done on the language front, OpenACS 4 should become timezone-aware.  Given that the core stuff's centralized in acs-datetime (thought clients like Calendar need to know how to display them and allow for selection, too) it shouldn't be too difficult.

That would be a nice, isolated, reasonably-sized task for someone.

Of course OF's been making changes to Calendar for dotLRN and has mumbled about it and acs-datetime having an overly complex datamodel and calendar itself being rather poorly implemented.

Perhaps Yon can summarize the changes OF has made?  Did OF tackle incorporating timezone stuff?

Then there's currency ... the form builder has some minimal support for currency.  You can build anything that can be represented with a leading symbol (blank allowed), whole part, separator, fractional part, and trailing symbol (blank allowed).  There are tools to cast from sql_number to the form builder's list format.  The currency form builder datatype was broken in ACS 4.2 but I've fixed it in OpenACS 4.5, though it's still very minimal in capability.

So ... we do have some of the building blocks for non-language globalization issues, too ...

Collapse
7: Response to Globalization (response to 1)
Posted by Tom Jackson on

You also need a useful discussion on configuration for multiple languages. There have been several threads concerned with this. Questions are:

  • How to get your database to handle multiple languages?
  • How to configure AOLserver to handle multiple languages?
  • How to setup a testing environment to find bugs?

I have started to work on these issues because I have a site that uses Italian, and another that wishes to use Farsi, running from the same AOLserver instance.

In my case using the UTF-8 encoding of Unicode seems the only possibile solution. I didn't have any trouble configuring PostgreSQL to handle Unicode. I followed the advice on other threads for the configuration. Also, AOLserver was easily configured, I think. The main issue so far is setting up a testing environment to actually verify that these two big chunks of software are actually working.

To help, I have started to collect a few test pages. These are currently grouped at http://zmbh.com/utf-8/.

Here is a description of a few of these:

  • http://zmbh.com/utf-8/UTF-8-test.txt tests a bunch of stuff related to the UTF-8 parser. You can use this to test if your display's parser has any bugs. So far I haven't found a web browser that doesn't have a few, but this doesn't necessarily mean that you cannot view correct UTF-8 characters, only that incorrect characters are not handled correctly. This could lead to security problems. The only display that I have working is an xterm in -u8 mode. You probably need something similar to RedHat 7.0+, then you need to install 10646-1 fonts and start xterm with a command similar to:
    LC_TYPE=en_US.UTF-8 xterm 
     -fn '-Misc-Fixed-Medium-R-Normal--15-140-74-75-C-90-ISO10646-1'
    
    The UTF-8 test file has the feature that if your parser is working, each line of the file is 79 chars plus a newline. The 79th char is '|', so you get a nice line down the right side that should line up.
  • http://zmbh.com/utf-8/utf8.html has many languages on it and can be used to check if the font you are using supports the language you want. Not every font supports characters from every language. A correct parser will 'replace' characters it does not have a glyph for with it's replacement character. Sometimes this is a question mark that looks a little weird, or an upside down question mark, or in the case of xterm, an dotted outline of the boundary of the character. It is not a bug in your parser to show these replacement characters, it is a bug if other strange, obviously non-language characters show up. It means your parser did not correctly find the multibyte character.

    It might still be possible that something along the line mangled the file. What seems to work for me is to download the file with wget and cat the file with the correctly configured xterm.

  • http://zmbh.com/utf-8/fconfigure.tcl opens the utf8.html file and configures the channel and reads the data into a string. It then uses ns_return to return the string. I used wget for this file as well and then used diff to figure out why the lengths were different. Everything was the same except the Vietnamese line. Maybe there is a bug in ns_return?
Collapse
8: Response to Globalization (response to 1)
Posted by Walter McGinnis on
Just an observation.  Shouldn't this thread be in the Design
forum?

Its certainly not worth moving or starting over, though.

Collapse
9: Response to Globalization (response to 1)
Posted by Reuven Lerner on

Tom mentions UTF-8 in an OpenACS site. I've been running a site with OpenACS 3.x in UTF-8 for the last 6-8 months (yad2yad.huji.ac.il); one of the requirements was that it work in Hebrew, Arabic, and English.

Getting PostgreSQL to work in Unicode wasn't hard at all; just pass the --encoding flag when you create the database. And once I put in the HackContentType flags described in someone's posting, the entire site worked just fine without modification. We're using the news, bboard, and chat modules in UTF-8, and everyone is pleased and impressed.

The few problems that I had were:

  • Getting ns_sendmail to work correctly with UTF-8. I ended up modifying modules/tcl/sendmail.tcl to encode e-mail in UTF-8.
  • Making sure that HTML forms would work for input in UTF-8. This normally happens if the encoding is set correctly, but testing this and double-checking that every page had the right content-type was tough.
  • Here and there, people have been having weird problems with encoding that we can't easily duplicate. The data looks close enough to Windows-1255 (i.e., Hebrew and English) that I suspect user error, but that's a pretty lame excuse when your users are elementary school students.

I haven't yet had a chance to look into i18n and Unicode issues in OpenACS 4.x; that's one of next week's challenges. Now that I think about it, I wouldn't mind seeing a global parameter that sets the encoding for pages and for outgoing e-mail.

And of course, none of what I've written here is true i18n; it's just a matter of ensuring that the right text can potentially appear on the screen.

Collapse
Posted by Tilmann Singer on
There are lots of places where formats are hardcoded in to_char() calls, so as initially discussed in this thread. I grepped through the toolkit for such format strings to find the most commonly used. Below is a list of the results with an occurence greater than 5. The only thing that can be said by looking at the results is that currently the toolkit uses a wild mixture of format strings ...

It would be a step towards internationalization if we had some centrally defined pl/sql functions that do that formatting instead of using to_char(), so I'd like to suggest to introduce the following ones:

acs_dt__time (just the time, e.g. "6:00 PM")
acs_dt__date (just a date, e.g. "12/31/2002")
acs_dt__date_long (date in longer format, like "December 31, 2002")
acs_dt__datetime - both date and time
acs_dt__datetime_long - both date and time in a longer format

Is that sufficient?

I think acs-datetime is the right place to add this stuff, but then this package would have to be made part of the always required packages.

Also what would be the best approach to allow for changing locales per request? So that it can be set up in a way that on the same server acs_dt__date(some_date) produces "31.12.2002" for a german user and "12/31/2002" for an english speaking user?

The only solution I can currently think of is to add a second parameter that passes the request's locale down to the pl/sql level. That would make a .tcl page with a query look like this:

set locale [ad_some_yet_to_write_locale_function]
db_1row getdate {
        select acs_dt__date(some_date, :locale) from bla
}

Looks a bit tedious to me. Any better solutions?

List of grep results follows.

{YYYY-MM-DD HH24:MI:SS} 86
yyyy-mm-dd 55
YYYY 46
HH24:MI 42
{Month DD, YYYY} 38
{Mon. DD, YYYY} 35
{MM/DD/YYYY HH24:MI} 19
HH:MIpm 18
{MM/DD/YY HH:MI AM} 18
{Mon DD, YYYY} 16
YYYY-MM-DD 15
MM/DD/YYYY 13
HH24 13
J 12
Q 12
{YYYY MM DD HH24 MI SS} 11
MM-DD-YYYY 11
YYYYMM 11
{Mon DD} 9
{HH24:MI:SS Mon DD, YYYY} 7
{MM-DD-YYYY HH12:MI:AM} 6
{HH24:MI, Mon DD, YYYY} 6
mm/dd/yyyy 6
{Mon fmDDfm, YYYY HH24:MI:SS} 6
{MM/DD HH24} 6
{Month DD, YYYY HH:Mi am} 6
hh24:mi:ss 6
{yyyy-mm-dd hh24:mi:ss} 6
{Month DD, YYYY HH:MI:SS} 6
{Day Month DD, YYYY} 6
YYYYMMDD 6
MMDDYY 6
{Mon DD, YYYY HH:MI:SS PM} 6
9999.9 6
{YYYY-MM-DD HH24:MI} 6
fmMM/DDfm/YYYY 6
{MM/DD/YY hh12:Mi am} 6
Collapse
11: Response to Globalization (response to 1)
Posted by Lars Pind on
Tilman,

Thanks for an interesting piece of research. Clearly there's a need to standardize.

Would it make sense to do the pretty-formatting in the web server layer? That way your pretty-printer-proc will automatically have access to the user locale. And it's possibly easier to do more advanced stuff, such as being able to say "Today at 10:32" or "Yesterday", when that's when it was.

Collapse
12: Response to Globalization (response to 1)
Posted by Andrew Grumet on
For the Development Gateway we use an approach similar to the GP sysetm described above. An ADP page may contain bits of translatable text enclosed by TRN tags (our homegrown variety). A translator who has "translation mode" toggled on sees an edit link next to every translatable item. For simplicity the system uses 100% UTF-8, making no attempt to negotiate character sets with the User-Agent. Pages which use this system are linked from here: http://www.developmentgateway.org/node/118859/. (note: though the top page uses images, and links to URLs that all contain "en", if you drill down you'll see that character data in the various languages is returned)

The DG provides one interesting feature that I don't think was mentioned above. We distinguish between publisher-added "navigational" text and user-added content (a couple of our custom packages support language tagging of content). This helps, for example, a user whose first language is Spanish but who can read English and French as well. We allow such users to specify a single navigation language but multiple content languages.

In a few cases, we bypass the TRN tags and use language-specific ADPs, switching the target of ad_return_template based on the user's setting. This is useful for pages containing HTML forms that we don't expect to change much.

Good things about this system:

  • A lot of functionality for not too much work
  • Translators can view items in context before translating them
  • No tcl-level programming intervention required to make items translatable or to add new translatable items
Not-so-good things about this system:
  • Internationalizing an existing ADP page can require a lot of tedious hand-editing to add the needed TRN tags. I've tooled around with some code that uses regexps to break up a markup page along "<" and ">" boundaries but haven't gone very far with it. I'd be happy to share the code and ideas with anyone who is interested.
  • Our translation keys are global. Though we have some informal naming conventions, we have something of the system-wide sea of keys Don mentions above.
  • We don't have an easy way to track all of the places that a key is used. This is helpful for ensuring consistency, i.e. that a translation is correct for all of the places where a key shows up (or helping us determine that new keys are needed).
  • We haven't made any attempt at solving date- or currency-formatting issues.
Not-so-good things that seem harder and more subtle:
  • As Henry Minsky has pointed out elsewhere, Unicode/UTF-8 is not smart about fonts. Henry can describe it better than I, but the problem is that while two (or more) languages share certain characters, their "pretty" representation may depend on the language. So ultimately the right solution is probably to do the locale/charset negotiation and rely on the browser to pick a good font.
  • Our system doesn't know anything about language constructs. I.e. it doesn't know that "le voiture" and "voiture" are related.
As a final, mildly off-topic point, one of the cooler things to appear in recent months is this idea of a distributed translation database (see http://www.newscientist.com/news/news.jsp?id=ns99992115). This opens up the possibility of building something of long-lasting value with a much wider community, not to mention making translation help available through XML-RPC and SOAP.
Collapse
13: Response to Globalization (response to 1)
Posted by Don Baccus on
The DG provides one interesting feature that I don't think was mentioned above. We distinguish between publisher-added "navigational" text and user-added content (a couple of our custom packages support language tagging of content).
Internally the GP system does the same. Currently content's exposed to the user based on their language choice, but it would be possible to embed French content pages in a page with English navtools, images, eye-candy etc.
In a few cases, we bypass the TRN tags and use language-specific ADPs, switching the target of ad_return_template based on the user's setting. This is useful for pages containing HTML forms that we don't expect to change much.
We're not doing this at GP. However we are giving each National Greenpeace organization (which are separate legal and financial entities than the International umbrella org) the ability to customize each and every template in the system. We wrote a slightly modified version of ad_return_template that first looks for a customized template for the national subsite we're visiting. If it doesn't find it, the default template is returned. The web editors just upload the custom template to the right place and it's found automatically. It's a bit fragile but it fits their needs and budget.

This is tied to a GP-specific "homepage" package which is mounted for each of the national organizations that are participating. So it's not useful in the general context of OpenACS 4. However it's given me ideas for implementing similar functionality bsaed on acs-subsite mount-points. Currently it's easy to change the look-and-feel of a subsite via the master template, which lets you specify a CSS stylesheet, of course. But beyond that we have no mechanism to customize one subsite's look and feel, and this gives us one.

It's a bit off-topic but in Greenpeace's case the need is driven by the nature of the organization - pne umbrella international and individual national orgs each with their own opinion of what their website should look like. This is probably not unusual in the world of international NGOs so vaguely relates to "globalization".

Collapse
14: Response to Globalization (response to 1)
Posted by Tilmann Singer on
Lars,

propably moving the formatting into the webserver level is the only way. I think the argument Jonathan brings up against the inefficiency of formatting the timestamps in the webserver can be consolidated by selecting the timestamp in epoch format from the database (select date_part('epoch', some_date) as some_date ...), because this can be fed directly to the tcl clock command without extra parsing overhead.

The only further drawback of this that I can think of is that one would have to do some extra typing in the tcl pages, but maybe this approach is the one with the least effort. For example a multirow would look like this:

db_multirow dates dates {
  select date_part('epoch', some_date) as some_date from foo ...
} {
  set some_date_pretty [dt_short $some_date]
}

Some typing could be saved by defining a dt_epoch(timestamp) plsql function.

And yes, it would be a lot easier to do fancy stuff like return "Yesterday", or "2 hours ago". The procs could have a -fancy_p switch for that.

Collapse
15: Response to Globalization (response to 1)
Posted by Titi Ala'ilima on
We just finished doing some work with Henry Minsky's acs-lang, with a couple of extra hacks to the request processor and a few hacks to acs-templating, so that we can now specify locale-specific .adp and/or .tcl pages.  Locale is specified in the db if they're logged in or in a cookie otherwise.
Collapse
16: Response to Globalization (response to 1)
Posted by Malte Sussdorff on
Though it might be a little bit late, we have a site running using international characters throughout the world as well as using multiple languages in a test phase.
<p>
If you want to have a look at how it was designed go to <a href=http://www.sussdorff-roy.com/resources/internationalization>http://www.sussdorff-roy.com/resources/internationalization</a>.
<p>
Thought I already mentioned this, but well, one never knows.
Collapse
17: Response to Globalization (response to 1)
Posted by Andrei Popov on
To revive this thread a little.  Suppose I want to be able to use the same database instance, but serve templated pages in different languages.  Say, having ticket-tracker page return the *same* ticket list, but have headings, etc. be presented in different languagaes, depending on user preferences.  Any idea how this can be done?

Tnx