Forum OpenACS Development: What issues remain for OpenACS on AOLserver4

The AOLserver Core Team would like to know what outstanding issues exist for using OpenACS on AOLserver4.

  • I have one -- filter bug: when a filter has a error, the server just closes the connection to the client. The error is logged, but a 500 response should be returned. Some browsers, like Mozilla and Opera try multiple times to grab the page when this happens, sometimes masking the problem. A fix is being worked on.

    Any others?

  • Collapse
    Posted by tammy m on
    This nsv bug is still open. It effected me like this. I don't know what other effects it may have on OACS packages as I don't use a lot of them (I'm just getting starting with OACS).

    Also, it looks like the Site Map returning "no data" or "broken url" is AOLServer4 specific (I haven't run earlier AOLServer versions with OACS but others have commented that it worked).

    Collapse
    Posted by Tom Jackson on

    Thanks Tammy, it looks like the filter error bug has been fixed. Vinod also seems to have figured out that the (incorrect) use of ns_eval was causing the problem with the site node map. I don't know what his solution will be, but he will probably substitute an nsv array for the bad code.

    I think the nsv bug is still open, this should be a high priority to fix.

    It is very helpful to report these bugs. OpenACS has a huge code base and it serves as a great test of AOLserver code.

    Collapse
    Posted by Peter Alberer on
    I had a problem with ns_adp_parse. Did no one else see that problem?

    <blockquote>When trying to access the webstats area or some static
    html pages everything works ok (pages are devlivered),
    but when I try to get a page that is using the openacs
    templating system there is no answer. I could trace the
    problem to the function "ns_adp_parse". That function
    does not return and does not deliver an error either. It
    seems to break when compiling tags defined by openacs.
    </blockquote>

    Collapse
    Posted by Don Baccus on
    I haven't seen this last problem, no, but I'm not sure if I've tried any pages with our own tags defined.  I've been doing all my OpenACS 4.6 and dotLRN 1.0 development on AOLserver 4.0 beta2, though - and so has Jeff, I believe.

    And I haven't seen the site map problem, though perhaps I didn't understand the bug report and just haven't visited it in a way that triggers the problem.

    The biggest issue has nothing to do with OpenACS - the fact that code that's not threadsafe has snuck into the implementation of the "file" command, leading to seemingly random crashes or (even worse) memory corruption that can cause files to be written to seemingly random directories etc etc.

    What's the status of this problem?

    It's a TCL 8.4.x problem, not AOLserver problem, but since AOLserver 4.0 requires Tcl 8.4.x it means one can't use AOLserver 4.0 in a production system.

    Collapse
    Posted by Tom Jackson on

    Don, thanks for posting this one. It is like we have our hands tied, since it isn't a AOLserver bug.

    The bug is due to passing a Tcl_obj between threads (in the cd command), and it hasn't been fixed. Apparently the person responsible for this is on vacation or busy. I'm amazed that such a problem would occur and it wouldn't be tagged as a critical bug. On some platforms (BSD derived), using pwd calls cd, so the bug is much worse. I wasn't aware that the file command had a problem too.

    Also, the nsv bug has been fixed.

    Collapse
    Posted by Don Baccus on
    If I understand Zoran's posts correctly, a variety of file commands call cd as part of their implementation.

    You're right, it is disappointing that the Tcl team hasn't treated this as being of being a crisis-level bug needing an immediate fix.  I guess not that many people use Tcl threaded outside the AOLserver community ... on the other hand AOLserver users represent a significant slice of the Tcl community.

    Collapse
    Posted by Lars Pind on
    If the Tcl community is not responding, is there a chance that we could temporarily fix this ourselves, and include a patched version of Tcl in AOLserver until whoever gets back from vacation? Or is the problem non-trivial to fix?

    Has the Tcl core team or Ousterhout himself been contacted about this (perhaps with a patch)?

    /Lars (who hasn't looked into AOLserver 4 at all but feels like he should)

    Collapse
    Posted by Tom Jackson on

    Lars,

    I'll post your questions to the AOLserverCore list. I need an update, and a chance to hopefully nudge things along.

    From what I understand right now Zoran Vasiljevic (a Core Team member) is developing on Darwin. On this platform the cd command is used behind the scenes in a number of commands, so the problem shows up very quickly. All BSD variants seem to hav e the same bug. So he discovered the bug and traced it to the passing of tcl objects (pointers?) between threads. The fix is supposed to need to change this to something threadsafe.

    Collapse
    Posted by Andrew Piskorski on
    Lars, my understanding is that fixing the Tcl thread-safety bug in question is quite non-trivial.

    In his original 3003-03-26 post to the AOLserver list reporting the problem, Zoran said:

    The problem is in Tcl generic/tclIOUtil.c and naive handling of static Tcl_Obj *cwdPathPtr. The pointer to this Tcl object gets shuffled arround threads by simple reference, it is read (referenced) without proper locks, etc. The implementor obviously protected the most obvious write operations, but neglected any others. Also, the Rule#1 in Tcl "Do not pass Tcl_Obj's between threads" is grossly violated.

    And here's a different, more round-about take on why it's probably a difficult problem:

    Back in March when this came up on the AOLserver list, I didn't understand that the current working directory of a process is maintained process by the kernel, not by the process itself. So I was speculating about maybe being able to fix things by simply giving every thread it's own independent thread local storage (aka, thread specific data) CWD. Here's what Rob Mayoff had to say about that:

    Perhaps you do not realize that a process's current working directory is tracked by the kernel, not by the process. Tcl keeps track of its CWD for speed, but ultimately it's the kernel, not the process, that resolves relative pathnames, so it's the kernel's idea of the CWD that matters.

    I believe that POSIX requires that all threads in a process share a working directory. Making each thread appear to have its own working directory requires either non-standard kernel support for per-thread CWD (which Linux has, but I don't think you can get to it through the pthreads interface), or intercepting every system call that involves a pathname (open, link, symlink, unlink, rename, access, stat, lstat, chdir, chroot, chmod, chown, lchown, mknod, mkdir, rmdir, bind, connect, and probably some more that I've forgotten). You might be able to ignore some of these for AOLserver, but intercepting any of them isn't necessarily easy, and it's definitely not possible to do so portably.

    It still might be the best way to fix this problem, though.

    Note Rob's last line - scary! Zoran independently said much the same thing:
    Eh, the cwd is the thing which is used by most path-related sys/lib calls to resolve the absolute path of the file. It is tracked in the kernel, not in the process, so in order to make this happen, you ought to intercept *all* of the sys/lib calls fiddling with paths. Now, Tcl with its virtual filesystem *might* achieve this, since it really isolates the upper layers from the OS-specifics. But, if you ask me, I think this is voodoo.

    To be honest, I was also playing with this idea, but after giving it a serious thought, I've abandoned it.

    Anyway, Zoran was working on fixing the bug, and last we heard he had some sort of fix (maybe only partial, I'm not sure) as of March 27, but it wasn't in the Tcl core yet. I haven't heard anything since then.

    Oh yeah, and totally off-topic: This business of CWD always being tracked by the kernel, etc., is making me think that the exokernel guys really do have the right idea, and that safely multitasking the hardware and providing nice system call abstractions should be independent features of the OS environment, not both mushed together into the one system-wide kernel.

    Collapse
    Posted by Tom Jackson on

    Zoran says he has a fix for the cd bug. Here is his reply to Lars' questions:

    There is, albeit I think it will be better to push 8.4.3 out, ASAP.
    
    
    > Has the Tcl core team or Ousterhout himself been contacted about this
    > (perhaps with a patch)?
    >
    
    Yes. Patch is posted to SF and the person involved is aware about it.
    
    Cheers,
    Zoran
    
    Collapse
    Posted by Andrew Piskorski on
    That would be SourceForge bug 710642. Looks like 711232 is also related.
    Collapse
    Posted by Jim Lynch on
    Hi, tcl-8.4.3 has been released today.

    You can read the announcement, release notes and changelog too.

    This release fixes a multithreading issue, which boiled down to tcl's "cd" command not being thread-safe. While most users didn't notice (because they didn't use cd in threads), aolserver cds a lot, so we noticed :)

    Also, an issue that occured in tcl-8.4.0 and was fixed in tcl-8.4.2 involved an erroneous status code from "catch".

    (details: the problem was that the return value of catch was TCL_OK (0) when it should have been TCL_RETURN (2) when encountering a return statement.)
    Thanks to Mark Dalrymple for tracing the problem he saw to catch; I then asked the tcl people if that was a known issue, and it was. This issue affected openacs because it checks some invocations of catch for return value of 2 in db_exec and friends. A suggestion was made that the sense of the test be reversed, and the code should check for TCL_ERROR (1) instead. This way, the important condition (error occured) is tested for, not whether or not a return statement was encountered in the catch block. Thanks to RockShox on freenode irc's #tcl channel for that suggestion.
    Collapse
    Posted by Mohan Pakkurti on

    OpenACS + aolserver 4.0 on Solaris - I have been having this problem, and am puzzled as to what could be going on. Is there anyone who has managed to get OpenACS + aolserver 4 working on Solaris?

    /Mohan

    Collapse
    Posted by Joel Aufrecht on
    Can we use AOLserver 4 with OpenACS in production _if_ we also use TCL 8.4.3?  Is this our platform for OpenACS 5.0.0?  Should I update the HEAD documentation?  Is this the end of the patched AOLserver 3 distribution?
    Collapse
    Posted by Talli Somekh on
    Tcl 8.4.4 was released in late July, at least according to the website.

    talli

    Collapse
    Posted by Tom Jackson on

    Should this wait until AOLserver4 is actually out of beta? One nice thing is that it is relatively easy to replace AOLserver in and OpenACS installation, but one issue still open is SSL. If this function is moved to a proxy server it might be a go at this point.

    Collapse
    Posted by C. R. Oldham on
    <blockquote> [SSL]...If this function is moved to a proxy server
    it might be a go at this point.
    </blockquote>

    If SSL is moved to a proxy server then you must serve all requests via SSL, or use a "smart" proxy server that knows what parts of your site need to be served via SSL.  You cannot communicate this information back into OpenACS without some hacking, so I would vote that we not recommend AOLserver 4.0 until there is a working SSL implementation.

    Collapse
    Posted by Bart Teeuwisse on
    I did some hacking to the security procs of OpenACS to achieve just that for Pound. Having tried Squid and failed to redirect HTTP connections to HTTPS connections in the proxy I've switched back to Pound. Squid is also incapable of informing AOLserver which connections to AOLserver are HTTPS connections to the Proxy.

    Pound has the issue with ns_write but there is a patch in the making to removing this limitation. In all other respects, I found Pound to be better. For example, Pound can add a custom header to requests forwarded to AOLserver when the request comes in as a HTTPS connection to Pound. Using this information, I have modified to the security procs of OpenACS to treat these requests as if they were HTTPS connections to AOLserver.

    The big win is that security management becomes transparent to OpenACS. One can still use the same security methods in OpenACS as before.

    Also, nsopenssl should not be far of for AOLserver 4.0.

    All in all, AOLserver 4.0 can be used with OpenACS under certain circumstances:

    1) When the site doesn't require SSL
    2) When the site uses SSL but off loads the SSL handshake to Pound and user pages don't use ns_write
    3) When the site uses SSL but off loads the SSL handshake to Pound and Gustav's patch is applied to Pound.

    Options 2) and 3) also require my hack to OpenACS. Should I be committing this hack to CVS?
    /Bart

    Collapse
    Posted by Andrew Piskorski on
    Bart, your changes to make OpenACS support using Pound sound like they're very useful, and should go into the toolkit. Will you add them for OpenACS 5.0?

    Barry, the patch Bart is talking about is for Pound. One of the Pound maintainers posted to the AOLserver list with info about it. Basically, AOLserver happens to use older style syntax for some HTTP stuff, which Pound didn't support yet, so he's adding it.

    Collapse
    Posted by Barry Books on
    I had never seen pound before. I've used ssltunnel in the same way but now I've switched to SSL hardware. I found the software performance almost unusable and SSL hardware is very cheap now on ebay. I've picked up 4 Intel boxes for about $200 a piece. They claim something like 600 connections per second. In software I might be able to do 6.

    I did run into some redirect problems but did not see anything with ns_write. Is the patch you have for pound or aolserver?