Forum OpenACS Q&A: nsd aborts when restarted after apparently successful openACS installation

Hi ppl,

i have just installed aolserver, created my postgresql database as nobody, done createlang plpgsql mydbname as superuser, etc, got nsd (aolserver) running, gone through the openACS installation process, registering the admin user details etc and all looked completely successful. The tables were all created in the database, and there were no errors in my postgresql logfile, etc. The aolserver was shut down automatically and i was informed everything would be operational when i retsarted my aolserver.

Unfortunately, when i try to restart aolserver, with:

cd /usr/local/progs/aolserver
./bin/nsd -t /home/nobody/web/vorpal/vorpal.tcl -u nobody -g web

the server starts up, runs flat out for a while, then aborts completely.

Examining the error it says all the modules are loaded successfuly including nsxml etc

There is this warning:

[09/Mar/2003:01:54:31][9418.1024][-main-] Warning: apm_boostrap_load_file skipping /home/nobody/web/vorpal/packages/acs-tcl/tcl/tcl-documentation-tests.tcl because it isn't either a -procs.tcl or -init.tcl file

There are also a lot of the messages as shown below:

[09/Mar/2003:00:53:41][7212.4101][-conn1-] Notice: /home/nobody/web/vorpal/packages/wp-slim/wp-slim.info
NOTICE:  identifier "apm_package_version__version_name_greater" will be truncated to "apm_package_version__version_na"
NOTICE:  identifier "apm_package_version__sortable_version_name" will be truncated to "apm_package_version__sortable_v"

But besides that everything seems to go well, until

[09/Mar/2003:01:54:44][9418.1024][-main-] Debug: Loaded packages/acs-content-repository/tcl/acs-content-repository-init.tcl.
[09/Mar/2003:01:54:44][9418.1024][-main-] Debug: Loading packages/acs-mail/tcl/acs-mail-init.tcl...
[09/Mar/2003:01:54:44][9418.1024][-main-] Notice: Scheduling proc acs_mail_process_queue
[09/Mar/2003:01:54:44][9418.1024][-main-] Notice: acs-mail: ns_uuencode works!!
Aborted

The end of my postgresql log shows that everything was going great until nsd aborted...

2003-03-09 01:54:44 [9424]  DEBUG:  StartTransactionCommand
2003-03-09 01:54:44 [9424]  DEBUG:  query:
    select distinct package_key
    from apm_package_versions
    where enabled_p='t';
2003-03-09 01:54:44 [9424]  DEBUG:  CommitTransactionCommand
2003-03-09 01:54:46 [9426]  DEBUG:  pq_recvbuf: unexpected EOF on client connection
2003-03-09 01:54:46 [7073]  DEBUG:  reaping dead processes
2003-03-09 01:54:46 [7073]  DEBUG:  child process (pid 9426) exited with exit code 0
2003-03-09 01:54:46 [9425]  DEBUG:  pq_recvbuf: unexpected EOF on client connection
2003-03-09 01:54:46 [7073]  DEBUG:  reaping dead processes
2003-03-09 01:54:46 [7073]  DEBUG:  child process (pid 9425) exited with exit code 0
2003-03-09 01:54:46 [9424]  DEBUG:  pq_recvbuf: unexpected EOF on client connection
2003-03-09 01:54:46 [7073]  DEBUG:  reaping dead processes

I am running Mandrake 8.2, using Matts AOLserver distribution:
(AOLserver/3.3.1+ad13 (aolserver3_3_1_ad13)
  CVS Tag:        $Name: nsd_v3_r3_p1 $

PostgreSQL is 7.2

openACS is 4.6

does anyone have any ideas whats goin on here ?!?!?!?!

help!

John

Try setting/increasing the stacksize in the nsd config file. There are two occurences of stacksize in the config file and I don't know which one is the right one, thus I have both in mine:

ns_section "ns/parameters"
ns_param  StackSize      [expr 1024 * 1024]

ns_section "ns/threads"
ns_param  stacksize      [expr 1024 * 1024]

Could someone who knows it please post the recommended settings? (Maybe also as a comment on https://openacs.org/doc/openacs-4/aolserver.html) Thanks

thanks for the tip, but unfortunately still no luck :(

one thing though... even with the larger stacksize, it still fails at the exact same point...

so there is a problem here...

[snip several hundred lines of output at least...so the server is getting a fair bit of initialisation done before it dies]
[09/Mar/2003:23:07:34][30118.1024][-main-] Debug: Loading packages/acs-mail/tcl/acs-mail-init.tcl...
[09/Mar/2003:23:07:34][30118.1024][-main-] Notice: Scheduling proc acs_mail_process_queue
[09/Mar/2003:23:07:34][30118.1024][-main-] Notice: acs-mail: ns_uuencode works!!
Aborted

i notice that there is supposedly some patch for ns_uuencode to make it work with binary files... could it be a problem with that (in spite of the fact that the message above says ns_uuencode works)?

Or, perhaps it is failing on the step immediately after that one...

Could someone please start up their server with -f and -q options and tell me what the server does next, ie immediately after the lines that my server dies on ??

Maybe that will help work out what the problem is.

Im not sure whether it is worth wasting time trying to fix this error or whether it would be better to use the aolserver4 CVS which i just got. It built fine, but I will need to get nspostgres or configure it to work with postgres.so which i will do tomorrow.

If the aol4server CVS is not too hard to get working with openACS maybe that will be the way to go.

Thanks for any help. Im really quiet keen to have a look at the petri-net based openACS workflow package... with a working AOLsrver!

John

Collapse
Posted by John S on
its just not worth going there yet ...

compiling the AOLserver4 CVS and adding nssha and nsxml modules successfully still gives:

./bin/nsd -f -t /home/nobody/web/vorpal/vorpal.tcl -u nobody -g web
./bin/nsd: relocation error: /usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so: undefined symbol: TclCompileScript@tclByteCodeType

so it looks to me like some tcl related garbage is not being linked properly...

Shame on the AOLserver developers for requiring such a new version of tcl.

I am sick of this sort of rubbish. What is the point of making your software so bleeding edge that it doesnt even work... and people are forced to waste their valuable time compiling useless CVS filth because there  is no portable, recent, stable release available ...

I wouldn't be so peeved if it wasnt for the fact that this AOLserver thing is supposed to be robust, powerful, industrial-strength etc etc.

bleagh.

What a waste of time

What is the point of this useless AOLserver piece of junk anyway? What can it *supposedly* do that Apache can't??

It can barely talk to a postgresql database by the looks of it, the ssl module requires BSAFE which appears to only be available with 30 day trial licenses, and it doesnt even work!

Currently, the damn thing accomplishes less than the following c program, which will compile and even run:

void  main (void) {
//I am more useful than AOLserver4 CVS
}

Hmm, so now you want the cvs version of beta software to work flawlessly? I seem to be running from cvs just fine, most of the time. Try complaining at the AOLserver list, subscribe at listserv.aol.com.

I built aolserver from the cvs head on friday and did not encounter any problems. AOLServer4 does not require Tcl 8.4.2, it will work just fine with 8.4.1 which has been out since october last year (and probably with 8.4.0 as well but I don't know that first hand (and Tcl is quite mature and stable -- calling it bleeding edge is pretty absurd).

As for what aolserver can do that apache can't: for one, you can run OpenACS. Since that's what most of us here are focused on, it's reasonably important to us.

I don't want this to sound insulting but a lot of people have built aolserver from the cvs head without trouble. I guess it would be interesting to figure out where you went wrong but if you are too frustrated to bother that's fine as well. It would be nice to know what platform you are building for and what you gave to configure when you built tcl.

Well, yeah ok i admit i probably got a bit carried away there. By the looks of it the problem is a tcl problem anyway. the relocation error looks like my tcl is not linking properly.So I should probably be peeved with tcl, not aolserver.

I got tcl 8.4.2 which is supposed to be a stable release, and i did build it with enable threads, etc. The tcl 8.4 .2 did compile flawlessly and only failed one of its self-tests out of 8000 or so - it was some file handling test or other...

I also compiled tk8.4.2, even though aolserver does not require tk... and tk8.4.2 compiled fine but failed many of its tests... many worked, but there were a lot of failures.

As you say, tcl is supposed to be a mature and stable proudct as well, so i am disappointed with 8.4.2 so far... as the third release in the 8.4 series it should be ok, even if it was only released a few days ago. But i despise tcl anyway so I am not surprised. I will try 8.4.1 anyway.

I have lost the configure commands that I used to build tcl from my history, due to the demented way in which mulitple gnome-terminals mangle one's history, by reading it in when they start, and not writing it until they exit.. ie if you start 2 terminals, do a lot of commands in one, then exit it, it will write your commands into the history file. But if u then exit the 2nd terminal, it will overwrite the history with its version which has none of the commands you just did in the other terminal.... Anyway, I recall that the only additional option i used was to enable threads.

Anyway, the aolserver build actually seemed to go fine. I configured the build with:

./configure --prefix=/usr/local/progs/aolserver-cvs --with-tcl=/usr/local/tcl/tcl8.4.2/lib

I made /usr/local/progs/aolserver a link to aolserver-cvs.
Then i installed the nssha and nsxml modules, but no luck.

Regarding the AOLserver build, it was unclear to me whether it was still necessary to apply the patch for ns_uuencode to work with binary files and/or the patch to make aolserver work with the -g flag. The instructions on what to do with the patches from sourceforge are rather unclear... so i am hoping they are no longer needed with AOLserver4

Ill see how i go with tcl8.4.1 and report back.

also, re the platform, it is Mandrake 8.2 with many additional bells and whistles.

re the relocation error i did find this on the net which was interesting...

----------------------------------------------------------
Most likely Mandrake's python was not using the RTLD_GLOBAL hack that Red Hat Linux had.  If libimlib-jpeg.so needs a symbol called "_gdk_malloc_image", it needs to have a DT_NEEDED entry that says what library to get it from.  (i.e., it needs to include -lgdk on the link line)

As you can see:

[msw@sid msw]$ ldd /usr/lib/libimlib-jpeg.so
        libjpeg.so.62 => /usr/lib/libjpeg.so.62 (0x4001a000)
        libc.so.6 => /lib/i686/libc.so.6 (0x42000000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x80000000)

This would be OK if libgdk_imlib.so linked against libgdk, but:

[msw@sid 8.0]$ ldd /usr/lib/libgdk_imlib.so
        libSM.so.6 => /usr/X11R6/lib/libSM.so.6 (0x4003c000)
        libICE.so.6 => /usr/X11R6/lib/libICE.so.6 (0x40046000)
        libXext.so.6 => /usr/X11R6/lib/libXext.so.6 (0x4005d000)
        libglib-1.2.so.0 => /usr/lib/libglib-1.2.so.0 (0x4006b000)
        libc.so.6 => /lib/i686/libc.so.6 (0x42000000)
        libX11.so.6 => /usr/X11R6/lib/libX11.so.6 (0x4008f000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x80000000)
        libdl.so.2 => /lib/libdl.so.2 (0x4016c000)

So it's all totally broken.  The RIGHT thing to do is fix gdkimlib.  I
think it would be much easier to make your application use gdkpixbuf
which doesn't suffer these coding errors.
Cheers,
Matt
-----------------------------------------------------

So Mandrake have been known to make a mess of their linking from time to time.

But the output of ldd on my nsd is

: jss: 22:23:49 /usr/local/src/gui/tcl/tcl8.4.2/unix ; ldd /usr/local/progs/aolserver/bin/nsd
    /usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so => /usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so (0x40016000)
    libnsd.so => /usr/local/progs/aolserver-cvs/lib/libnsd.so (0x400b8000)
    libnsthread.so => /usr/local/progs/aolserver-cvs/lib/libnsthread.so (0x40105000)
    libdl.so.2 => /lib/libdl.so.2 (0x4011f000)
    libpthread.so.0 => /lib/libpthread.so.0 (0x40122000)
    libm.so.6 => /lib/libm.so.6 (0x40138000)
    libc.so.6 => /lib/libc.so.6 (0x4015a000)
    /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

So I cannot see how there can be any problem finding TclCompileScript@tclByteCodeType, as tcl8.4.2.so DOES look like it has been linked in properly and I expect TclCompileScript should actually be in tcl8.4.so itself, and not require anything else to be linked in at run-time...

Maybe TclCompileScript has been left out of the library, though that seems very very unlikely to me or tcl would have failed many tests. I may try to find that symbol in tcl8.4.so to check on that though. Maybe its something to do with the "@tclByteCodeType" bit at the end.

argh.

I hope tcl8.4.1 works, because all i want to do is run AOLserver, not fix tcl8.4.2

John

sorry if it was a bit unclear but the reason i posted that bit re gdk_imlib etc becaus teh linking probs with that library were resulting in a relocation error which at first looked similar to mine:

--------------------------------------------------
On Fri, Jul 26, 2002 at 07:18:49AM -0600, Don Allingham wrote:
<blockquote> I have been fighting the problem for quite a while, and I cannot come up
with a workable solution. Under Mandrake 8.2, I keep getting the
following error:

/usr/bin/python: relocation error: /usr/lib/libimlib-jpeg.so: undefined symbol: _gdk_malloc_image

Things work fine under RedHat, SuSE, debian, any other distribution that
I or anyone else has tested. I can get around the problem by using
LD_PRELOAD, but this isn't a really good solution, as it is not very
portable.

export LD_PRELOAD='/usr/X11R6/lib/libX11.so /usr/lib/libgdk_imlib.so.1 /usr/lib/libgdk.so'

I've tried forcing a load of GdkImlib, but this doesn't seem to have an
effect. The first call to any image handling routine, either from python
or from libglade, causes python to abort.

</blockquote>
-------------------------------------------------

undfortunately in my case libtcl8.4.2.so should not need to link to anything else to get TclCompileScript anyway, all it shoudl need is:
: jss: 22:41:30 /usr/local/src/gui/tcl/tcl8.4.2/unix ; ldd /usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so
    libdl.so.2 => /lib/libdl.so.2 (0x400b4000)
    libpthread.so.0 => /lib/libpthread.so.0 (0x400b7000)
    libm.so.6 => /lib/libm.so.6 (0x400ce000)
    libc.so.6 => /lib/libc.so.6 (0x400f0000)
    /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x80000000)

So it is still beyond me why TclCompileScript (ie TclCompileScript@tclByteCodeType) cannot be found in libtcl8.4.so itself. the LD_PRELOAD wont work in my case because there is nothing i can preload to define TclCompileScript before it is needed in libtcl8.4.2.so because it should be IN libtcl8.4.2.so !!

what does "ldd nsd" say? also
nm libtcl8.4.so  | egrep TclCompileScript\|tclByteCodeType
The ns_uuencode patch in the OpenACS distribution is indeed broken, and I think it could cause the crash you are seeing.  I just submitted patch #698741 at SourceForge four days ago so it definitely isn't in any version of AOLserver.  You shouldn't need to apply it though, as OpenACS works around the issue in Tcl if the patch is not applied.  (As opposed to an applied but broken patch.)

If you really despise Tcl, don't like patching things, or want a simple out-of-the-box solution, then the OpenACS toolkit probably isn't a good match for you.  It includes over 170,000 lines of Tcl alone and has all the complexity that implies.

ldd /usr/local/progs/aolserver/bin/nsd
    /usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so => /usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so (0x40016000)
    libnsd.so => /usr/local/progs/aolserver-cvs/lib/libnsd.so (0x400b8000)
    libnsthread.so => /usr/local/progs/aolserver-cvs/lib/libnsthread.so (0x40105000)
    libdl.so.2 => /lib/libdl.so.2 (0x4011f000)
    libpthread.so.0 => /lib/libpthread.so.0 (0x40122000)
    libm.so.6 => /lib/libm.so.6 (0x40138000)
    libc.so.6 => /lib/libc.so.6 (0x4015a000)
    /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

nm /usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so | egrep TclCompileScript\|tclByteCodeType
0003ab40 T TclCompileScript
00099778 D tclByteCodeType

now i am really confused ... the damn thing is actually there...

what about "strace nsd 2>&1 | grep libtcl"?
re the ns_uuencode patch, i didnt apply it because it was unclear to me exactly how to do so.

same for the -g flag.

Its not that i dont mind patching things, or want an out of the box solution, but i DO want something that will basically WORK out of the box, even if I have to customise it later to do what i want. ie it would be nice if the aolserver would actually run so i could then have a working system to work with!

With the amount of free software out there, it is a huge deterrent if something does not at least compile and install and work in some basic way out of the box... because if it doesnt, how do i know whether it is worth wasting more time on, or whether i should try something else?

if I had a working aolserver i would at least be able to experiment a bit with the openACS packages and see if i can tolerate the tcl api. but thats impossible if the aolserver doesnt work out of the box

i cut back my LD_LIBRARY_PATH to nothing to reduce the paths searched (as i only need LD... set for the visualisation toolkit) and the result was:

: root: 00:32:36 /root ; strace /usr/local/progs/aolserver/bin/nsd -t /home/nobody/web/vorpal/vorpal.tcl -u nobody -g web 2>&1 | grep -i libtcl
open("/usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so", O_RDONLY) = 3
open("/usr/local/progs/aolserver-cvs/lib/libtcl8.4.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/local/tcl/tcl8.4.2/lib/i686/mmx/libtcl8.4.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/local/tcl/tcl8.4.2/lib/i686/libtcl8.4.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/local/tcl/tcl8.4.2/lib/mmx/libtcl8.4.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/local/tcl/tcl8.4.2/lib/libtcl8.4.so", O_RDONLY) = 3
: root: 00:33:36 /root ;

so it does find it in the end!

i just read something in the unix/README for tcl8.4.1 which has jogged my memory....

when i compiled my tcl8.4.2 i accidentally inserted a space after my --prefix= ie before the path that i wanted to install to (doh!). So it installed to / and i had to remove it manually. In the readme it says i should do a make distclean if i change any parameters to configure... and i did not do so before i recompiled with the right --prefix

so maybe that is the problem. Sorry to all if that does turn out to be the problem. After all my bitching it would of course be my own fault wouldnt it :)

well,

my humble apologies to all those whose time has been wasted trying to help me overcome the consequences of my own stupidity.

I recompiled tcl8.4.2 after doing a make distclean to clean up the mess caused by calling doing configure for tcl8.4.2 with:
--prefix= /there/should/be/no/space/b4/this/path

and then configuring again with:
--prefix=/much/better/with/no/space/b4/this/path

It still amazes me that this seems to have been the cause of all my problems... i mean, really, is it too much to expect that calling configure a 2nd time with new options should overwrite all products of previous configures with the new results, without having to do a make distclean???

oh well.

I also tried compiling AOLserver4 CVS against tcl8.4.1 and yeah, no problems there either.

A final comment is in order though. I still would not have a working AOLserver were it not for the fact that I was able to scavenge a copy of nsrewrite.so from the build I did of Matts AOLserver distribution. My CVS checkout of AOLserver4 did not come with any code for nsrewrite.so, and the nsrewrite module is NOT available at sorceforge with all the other modules for AOLserver.

Besides that the only very minor problem was that the sample openACS config file at:

https://openacs.org/doc/openacs-4/files/openacs4.tcl.txt

does not mention that openACS also needs AOLserver to load the nsdb module -- so it needs to have the line:

ns_param  nsdb            ${bindir}/nsdb.so

added at the appropriate place.

Sorry again to all those whose time was wasted especially those who took the time to try to help me.

john

The OpenACS config file hasn't been updated for AOLserver 4.0 as almost everyone here uses the OpenACS distribution of AOLserver, which forked from the AOL version a long time ago.  Merging the distributions is a recent effort, coinciding with work on the 4.0 branch which includes several other improvements.

nsrewrite isn't actually used by OpenACS so you can remove it from your config file if you want.  The module itself is in transition from OpenACS to SourceForge.  nsdb as a separate module is an AOLserver 4-ism.

I'm glad you found a solution to your problem!