Forum OpenACS Q&A: OpenACS Keepalive how to do it for windows based installation

Hi,

I am using project-open 3.5 v, which has bundled openACS server. We are facing strange issue that after 15 days or so the ACS server stops responding, even though we can still ping to the PC.
If we try to stop the service it will say "stopping" but will not stop. We need to reboot the PC for it to work and come to normal state.

I am looking for keepalive script which can run on windows , as i could not find inittab in project-open, is their any other method/ script to check the keepalive of server and restart it, if it is not responding.

I have asked the same question in project open but nobody replied, so thought may be should post in this forum.

http://sourceforge.net/projects/project-open/forums/forum/295937/topic/5199212

Thanks for any help.

Regards,
Sujata

Hi Sujata

what does the AOLserver error log report before the service stops? Are you using a Windows service to start and stop AOLserver?

best wishes
Brian

Thanks for the reply.

I am trying to get the logs of the server. I will post the relevant contents once i get them.

Yes, I am using windows service to start and stop the AOLserver.

Ok, I got the logs, I see this message repeated every day a number of times in error.log file but we don't have any problem in accessing the server.

But, the time it stops responding this was the last message.All other messages are of "Notice" type so not posting.

Please let me know if you can make out something. Generally it happens on weekends and we cannot access project-open on Mondays.

==========================================================

[24/Mar/2012:05:31:46][1244.3180][-sched:23-] Notice: im_mail_import.process_mails0: Error creating '/web/projop/Maildir/spam' folder: 'mkdir ("/web/projop/Maildir/spam") failed:

no such file or directory'
[24/Mar/2012:05:31:54][1244.3136][-sched:16-] Notice: acs-mail-lite: about to load qmail queue
[24/Mar/2012:05:31:54][1244.3136][-sched:16-] Notice: acs_mail_lite::load_mail_dir: queue_dir=''
[24/Mar/2012:05:31:54][1244.3136][-sched:16-] Notice: acs_mail_lite::load_mail_dir: queue dir = /new/*, no messages
[24/Mar/2012:05:32:16][1244.1684][-sched:25-] Notice: sync: uid=32644, pid=50898, day=2012-03-20 00:00:00+05:30
[24/Mar/2012:05:32:16][1244.1684][-sched:25-] Error: Tcl exception:
ambiguous option "file": must be authpassword, authuser, channel, close, content, contentlength, contentsentlength, contentchannel, copy, driver, encoding, files, fileoffset,

filelength, fileheaders, flags, form, headers, host, id, isconnected, location, method, outputheaders, peeraddr, peerport, port, protocol, query, request, server, sock, start,

status, url, urlc, urlencoding, urlv, version, or write_encoded
while executing
"ns_conn $var"
(procedure "ad_conn" line 90)
invoked from within
"ad_conn file"
(procedure "ad_parse_template" line 15)
invoked from within
"ad_parse_template -params [list [list exception_count $exception_count] [list exception_text $exception_text]] "/packages/acs-tcl/lib/ad-return-com..."
(procedure "ad_return_complaint" line 2)
invoked from within
"ad_return_complaint 1 "

  • [_ intranet-core.lt_Unable_to_determine_I]

    [_ intranet-core.lt_Maybe_somebody_has_ch]""
    (procedure "im_company_internal_helper" line 5)
    invoked from within
    "im_company_internal_helper"
    ("eval" body line 1)
    invoked from within
    "eval $script"
    invoked from within
    "ns_cache eval util_memoize $script {
    list $current_time [eval $script]
    }"
    (procedure "util_memoize" line 20)
    invoked from within
    "util_memoize [list im_company_internal_helper]"
    (procedure "im_company_internal" line 2)
    invoked from within
    "im_company_internal"
    (procedure "im_cost::new" line 3)
    invoked from within
    "im_cost::new -cost_name $cost_name -user_id $hour_user_id -creation_ip "0.0.0.0" -cost_type_id [im_cost_type_timesheet]"
    ("uplevel" body line 5)
    invoked from within
    "uplevel 1 $code_block "
    ("uplevel" body line 1)
    invoked from within
    "uplevel 1 $code_block "
    invoked from within
    "db_with_handle -dbn $dbn db {
    set selection [db_exec select $db $full_statement_name $sql]

    set counter 0
    while { [db_getrow $..."
    (procedure "db_foreach" line 36)
    invoked from within
    "db_foreach hours $sql {

    ns_log Notice "sync: uid=$hour_user_id, pid=$project_id, day=$day"
    set cost_name "Timesheet $hour_date $project_nr $user_na..."
    (procedure "im_timesheet2_sync_timesheet_costs" line 48)
    invoked from within
    "im_timesheet2_sync_timesheet_costs"
    ("eval" body line 1)
    invoked from within
    "eval [concat [list $proc] $args]"
    (procedure "ad_run_scheduled_proc" line 42)
    invoked from within
    "ad_run_scheduled_proc {t f 61 im_timesheet2_sync_timesheet_costs {} 1331615237 0 f}"
    [26/Mar/2012:10:10:34][1244.1376][-thread1376-] Notice: nsmain: AOLserver/4.5.1 stopping
    [26/Mar/2012:10:10:34][1244.1376][-thread1376-] Notice: driver: stopping: nssock
    [26/Mar/2012:10:12:50][208.336][-thread336-] Notice: nsmain: AOLserver/4.5.1 starting
    ========================================================

    Thanks,
    Sujata

  • Hi

    first of all if the service isn't stopping, you can just kill it. I don't think it's necessary to reboot the server.

    In terms of what's causing your issue, it's not too clear. Did somebody manually stop the server at 26/Mar/2012:10:10:34? I find it hard to believe that the Tcl exception at 24/Mar/2012:05:32:16 actually caused the server to freeze up.

    One thing you could try before rebooting is to do a telnet to port 80 (as described here http://philip.greenspun.com/seia/basics ) just to see if the server is really down.

    The TCL exception appears to be caused by a call to ad_conn from within a scheduled proc. This is a bug, see https://openacs.org/forums/message-view?message_id=162503 for example. The scheduled proc needs to take into account that ad_conn is not available.

    Hope this helps
    Brian

    Thanks for your reply.

    Yes, as server was not responding (I thought so! ) because project-open web page was not accessible, we manually tried to stop the service on 26th March. And as it was taking too much time to stop we just rebooted the machine.

    Next time it happens, i will do telnet and see. We have observed the issue on 24 March, and after that on 13th April. So we have to wait for few weeks for it to reappear, most probable dates are either 28-29 April or 5-6 May.

    I will update my observation and let you know what actually happened.

    I tried to search for "how to kill the AOL-server in Win XP"
    could not get satisfactory answers. Below is what i got :

    1) Go to task manager -> select "nsd.exe" and end process tree

    2) Use command "netstat -ano" , get PID for 8000 port, which is actual project open port and give command "taskkill /f /PID 5072

    For both the methods when i see in services its showing -> AOLserver-projop "starting" but its not doing anything, again i have to stop it manually. But if i stop it immediately it gets stopped and i can restart it again.

    I am new to this, and i really appreciate if you can confirm which way to be used to kill the server or i m doing totally wrong way.

    Thanks,
    Sujata

    For a forced kill, either of those methods should be fine. This is obviously in the situation where the normal shutdown isn't responding. You should be keeping an eye on the error logs for the "Notice: nsmain: AOLserver/4.5.1 stopping" message. Sometimes it can just take a long time to stop.

    There was a lengthy discussion recently on the AOLserver mailing list about these occasional problems with Windows versions of AOLserver shutting down. I'm not sure what the final resolution was. You can see some of the threads here. http://permalink.gmane.org/gmane.comp.web.aolserver/16523

    Brian

    What i can see from the the log snippets are two things, most likely unrelated:

    - the first error is triggered from you scheduled procedure 25. You can figure out, what this is either from nsstats.tcl, and if you do not have nsstats installed, from ds/shell (or nscp) by running there "lindex [ns_info scheduled] 25". Normally, "ad_return_complaint 1" should be followed by an error message. You should first figure out what your sched proc 25 is before digging into this to find the right spot easier. However, while this error is not intended behavior, it is most likely not related to the seconds error, which happened nearly 5 hours later.

    - the second error happened in "im_company_internal_helper" in the package intranet-core, which is most likely a ]PO[ package, doing some timesheet calculation. Ask the PO-people why this fails. The snippet does not show, when exactly this error happened. We see from the snippet that at 10:10:34 the server is stopping. This is a strange time for a scheduled restart, most likely, someone stopped the server manually. The server was restarted 2min 16secs later - also most likely manually. The problem seems to me that a second server instance was started before the first server instance exited.

    When the server stops, it shuts down the network connections (as shown in the log) and shuts down all other services such as scheduled procedures (it tries to finish these). Then you should see a message like "Notice: sched: shutdown complete", ... "Notice: driver: stopped: nssock", and finally "Notice: nsmain: AOLserver/4.5.1 exiting". The exact detail of the messages depend on the configuration and logging detail.

    As we can see from the log at least the exit-message is missing. It looks to me, as if you have more or less 2 servers running, the first one hanging in shutdown state (but which as stopped accepting requests) and the second one waiting for the resources of the first, not being able to receive requests.

    Since you are running under windows, you are most likely using the compiled version of Maurizio, which has the Tcl shutdown details already deactivated, which are discussed in the thread Brian mentioned. Since this version performs already "an exit without running all pending operations", the worst thing which might happens are truncated operations - but no hangs.

    Are you performing scheduled restarts? If not, you should figure out, who or what is stopping your server (it does not do this by itself) and check there the stopping/starting conditions.