Forum OpenACS Q&A: Openacs process disappearing

Collapse
Posted by Jason Cheung on
Currently, there is a problem that is occuring periodically with one of our services. Below is a snapshot of the log right just before the process "disappears" and whenever we restart openacs, after 10 mins or so the same occurance happens.

Are there any suggestions as to how one should debug this? we have another service which uses the same code and config but rarely has this problem?

Regards,

Jason

[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0]
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0]
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: OACS= sec_handler: enter
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: OACS: Not a valid session cookie
, looking for login cookie
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: OACS= sec_login_handler: enter
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: OACS= sec_setup_session: enter
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: OACS= empty session_id
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: OACS= newly allocated session 59
523217
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: OACS= about to generate session
id cookie
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: Security: 1132125777 sec_generat
e_session_id_cookie setting session_id=59523217, user_id=0, login_level=0
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: Security: Getting token_id 39, v
alue E9550B9D21DF6C1BD6F2FF369E82D09544694CCB
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Debug: OACS= done generating session id
cookie
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Notice: Querying '
select 1
where 't' = acs_permission__permission_p('1208', '0', 'read');'
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Notice: dbinit: sql(localhost:5431:serv
ice0): '
select 1
where 't' = acs_permission__permission_p('1208', '0', 'read')
'
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Notice: sec_lookup_property starting --
jason
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Notice: Querying '

select property_value, secure_p
from sec_session_properties
where session_id = '59523217'
and module = 'acs-lang'
and property_name = 'translator_mode_p';'
[16/Nov/2005:15:22:57][11973.4123126704][-conn:service0::0] Notice: dbinit: sql(localhost:5431:serv
ice0): '

select property_value, secure_p
from sec_session_properties
where session_id = '59523217'
and module = 'acs-lang'
and property_name = 'translator_mode_p'

##script that checks existence of process and restarts if not found

[16/Nov/2005:15:24:11][12864.4142909664][-main-] Notice: nsmain: AOLserver/4.0.9 starting

Collapse
Posted by Koala Yeung on
I have encountered similar problem. It disappered when I block all the IPs of Yahoo! Slurp, Googlebot and msnbot. But that is RIDICULOUS for a public website!

Does any body have any other solution?

Collapse
Posted by Andrew Piskorski on
Jason, I'm afraid you need to be a little - ok a lot - more specific. What do you mean by saying the process "disappears"? What are the actual symptoms you are observing?

Secondly, you seem to have edited the AOLserver log snippet you posted. Does your log really show nothing at all for the 74 seconds between 15:22:57 and 15:24:11, or is that just an artifact of how you posted the info?

Collapse
Posted by Jason Cheung on
When I say the process "disappears" it means it cannot be found with ps -aewff | grep service0 - just gone without a trace.

As for the log it is an exact snapshot except for the comment I added to explain how I am able to restart the service soon after. If the log did show something for those 74 second, I would be in a much better situation.

Further to Koala's findings, I have blocked Googlebot and MSNbot and the problem stops. Below is a snippet of the service log - so is there a link between robots and session_id which is killing the process?

Again, this is very specific to one particular service which is affected. The same code on another service does not have this problem ...

----- service log -----
207.46.98.61 - - [16/Nov/2005:18:25:41 +0800] "GET /public/contents/66874 HTTP/1.0" 302 378 "" "msnbot/1.0 (+http://search.msn.com/msnbot
.htm)" -
207.46.98.61 - - [16/Nov/2005:18:25:41 +0800] "GET /public/contents/news?revision%5fid=66863&item%5fid=66862 HTTP/1.0" 200 33535 "" "msnb
ot/1.0 (+http://search.msn.com/msnbot.htm)" -
202.181.201.153 - - [16/Nov/2005:18:25:42 +0800] "GET /forums/message-view?message_id=66170 HTTP/1.0" 500 27436 "" "Mozilla/5.0 (compatib
le; Googlebot/2.1; +http://www.google.com/bot.html)" -
192.168.10.31 - - [16/Nov/2005:18:25:46 +0800] "GET /SYSTEM/dbtest.tcl HTTP/1.0" 200 7 "" "Wget/1.9+cvs-stable (Red Hat modified)" -
192.168.10.31 - - [16/Nov/2005:18:26:01 +0800] "GET /SYSTEM/dbtest.tcl HTTP/1.0" 200 7 "" "Wget/1.9+cvs-stable (Red Hat modified)" -
207.166.200.211 - - [16/Nov/2005:18:26:38 +0800] "GET /public/uptime.html HTTP/1.0" 200 8 "" "AOLserver-Tcl/4.0" -
207.46.98.44 - - [16/Nov/2005:18:26:40 +0800] "GET /robots.txt HTTP/1.0" 200 388 "" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" -
207.46.98.44 - - [16/Nov/2005:18:26:41 +0800] "GET /public/contents/article?revision%5fid=65132&item%5fid=65131 HTTP/1.0" 200 36558 "" "m
snbot/1.0 (+http://search.msn.com/msnbot.htm)" -
192.168.10.31 - - [16/Nov/2005:18:26:46 +0800] "GET /SYSTEM/dbtest.tcl HTTP/1.0" 200 7 "" "Wget/1.9+cvs-stable (Red Hat modified)" -
202.181.201.153 - - [16/Nov/2005:18:26:46 +0800] "GET /public/contents/my_domain_name/public/contents/12742 HTTP/1.0" 302 410 "
" "Googlebot/2.1 (+http://www.google.com/bot.html)" -

---

[16/Nov/2005:18:26:48][31059.4123126704][-conn:service0::0] Notice: dbinit: sql(localhost:5431:service0): '

select property_value, secure_p
from sec_session_properties
where session_id = '59873219'
and module = 'acs-lang'
and property_name = 'translator_mode_p'

Collapse
Posted by Torben Brosten on
What operating system?

Does aolserver post any warnings in the logs during startup?

Collapse
Posted by Gustaf Neumann on
Technically speaking, there is no openacs process, just the aolserver (nsd). There were a couple of fixes lately in tcl/aolserver, so i would recommed to upgrade to aolserver 4.0.10 (avail since jan) and to tcl8.4.11.

what os are you using?
Under usual conditions you get some hints from the error-log file. Anything special about your setup?

what is the contents of /public/contents/my_domain_name/public/contents/

what is the size of you nsd-process, how many connection threads are defined (maxthread)?

Collapse
Posted by Andrew Piskorski on
Jason, FYI, to get the best help here on the Forums, you really should give us a lot more info first thing, e.g.: What operating system and version? What version of AOLserver? (Apparently 4.0.9 from your log.) What AOLserver modules do you have loaded? (In particular, do you have any nonstandard modules?) What version of Tcl? How did you build and install AOLserver and Tcl? What threads and stacksize settings are you using in your AOLserver config file (in the ns/parameters and ns/server/$servername sections.)?

In general there are only three ways AOLserver can be killed: It itself chooses to shut down by Tcl code calling ns_shutdown or the like, it crashes, or some other process (the kernel, the user, whatever) purposely kills it.

Your AOLserver log has no messages at all from AOLserver about shutting down, it was just running normally, and then bang, a minute or so later it's starting up again. That means that AOLserver either crashed or was kill -9'd (SIGKILL). If it had been killed with a plain kill (SIGTERM) or ns_shutdown, you'd see evidence of AOLserver doing an orderly shutdown in its log - which you don't.

Do you have some form of process resource limits configured in your operating system? Memory limits or the like? I bet that's it - some web crawler hits your (otherwise low traffic) site, AOLserver grabs a bunch more memory, and then your OS shoots the process dead with kill -9. (If true, that isn't AOLserver's fault, your OS is misconfigured.) If you're on Linux, check with "ulimit -a". Ideally, run ulimit -a from the script which actually starts AOLserver.

If on the other hand AOLserver is in fact crashing, you want it to leave a core file so that you can figure out why it crashed. If you're on Linux, make sure you do something like "ulimit -c unlimited" in the script which starts AOLserver. That sets no limit on the max size of corefiles - the default is often 0, which means you won't get any corefile at all. You may want to set that as a system-wide default in "/etc/profile".

Collapse
Posted by Torben Brosten on
Regarding the ulimit, aolserver4.0.10 provides some warnings on startup if it detects misconfigurations.
Collapse
Posted by Jason Cheung on
versions
------------
Operating System: Fedora Core 3
AOLserver: version 4.0.9
tcl: version 8.4

building and installation
------------------------------
building with the instructinos given in openacs.org:
https://openacs.org/doc/current/complete-install.html
(only altered some name of folders)

configuration
-----------------
Modules Loaded: nssock nslog nssha1 nscache nsgdchart nsdb
stacksize: expr 128 * 8192

Planned kill
---------------
There is no planed kill for AOLserver process.
AOLserver process crashed itself at some point.

Clawer
--------
Yes. Clawer was the last visiter before crashes most of the time. And most of the time it is Googlebot.

resource limit
-----------------
I have not set any limit myself, but the os may set some
limit by default. Please see what I got from the OS:

[root@machine etc]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16380
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

I am trying your ulimit -c unlimited option now.
See if it is related. The site was down when an unblocked Googlebot visited the CLB site.
I have block that ip now

Collapse
Posted by Andrew Piskorski on
Jason, "ulimit -a" reports the current limits in effect for that shell, right now. You gave ulimit output for a root shell, which may or may not have any relevance at all. You need to know what if any limits are configured for the shell that starts the AOLserver process. Also, AFAICT you do not yet know whether or not "AOLserver crashed itself" - that one of the things you're trying to figure out.
Collapse
Posted by Jason Cheung on
ulimit
---------
Thanks for reminding. I didn't know that. I've set the OpenACS user (serice0, service1) ulimit now. Thanks

AOLserver crash
------------------
I think I am quite sure that it is crashed itself. Because:

1) I haven't shut it down.
2) I haven't schedule any shutdown at this frequency (10 or more in an hour sometimes)
3) There is no log for shutdown in the log file

Unless there is other evidence I can look for, I suppose the AOLserver process crashes itself.

Collapse
Posted by Andrew Piskorski on
Well, were the ulimit settings wrong, or not? Your assumption that AOLserver is crashing is still unfounded, for all the same reasons I explained above. If you fix the 'ulimit -c' setting and then see AOLserver leaving a corefile, then you would actually know that it is in fact crashing. And more importantly, you could run gdb on it to see where and perhaps even why it's crashing. Didn't you try that already, above on Nov. 27? So, do you see a corefile now, or not?
Collapse
Posted by Gustaf Neumann on
upgrade to aolserver 4.0.10 (avail since jan) and to tcl8.4.11.

Be sure, that you do not call "ns_server threads"
(see http://panoptic.com/wiki/aolserver/Ns_server)

Collapse
Posted by Jason Cheung on
I have narrowed the problem to a problem in configuration or coding of my website. It seems the request processor is not handling server errors properly from invalid URLs.

72.30.61.43 - - [27/Jun/2006:02:07:50 +0800] "GET /robots.txt HTTP/1.1" 200 303
"" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch
/slurp)" -

[27/Jun/2006:02:07:51][25599.3024083888][-conn:service0::49] Notice: MONITORING WEBSITE MOVEMENT <<<<<
<<<<<<<<<< iso-8859-1 /public/contents/category?cid=1017%3C/p%3E%0D%0A%3Cp%3E >>>>>>>>>>>>>>>>>>>

What has been entered is a bad id being entered from the clients browser and it is not redirecting the 500 to a file-not-found.html in /global.

Furthermore, if the url as a space after the file name followed by the above query - it results in server error and process is gone.

Any advice is appreciated.

Collapse
Posted by Jason Cheung on
Whats odd is that /public/main is a valid url (there is a main.adp,tcl) and typing /public/maintestingerror redirects fine back to the home page - BUT if I try something like /public/contents/a then it crashes.
Collapse
Posted by Jason Cheung on
It seems that the integer validation of cid is not working .. is there anything wrong with this snippet of code?

ad_page_contract {

A gateway, base one Category_id, redirect user to specific page

} {
cid:integer,notnull
{lang ""}
}

Collapse
Posted by Gustaf Neumann on
Jason, what version of openacs are you using? what version of tcl?

try the following in the shell:
set var x
set val 123
ad_page_contract_filter_proc_integer var val

(this is a valid test and returns 1) and then

set var x
set val [ns_urldecode 1017%3C/p%3E%0D%0A%3Cp%3E]
ad_page_contract_filter_proc_integer var val

(this is the test with the invalid id from your example)
If the integer test is the problem, the second one crashes.
the second call should return 0.

for me, this works nicely...

Collapse
Posted by Jason Cheung on
Hi Gustaf,

Thank you for your reply,

openacs 5.1.2d
tcl: version 8.4

The two tests did not return any error - and to update on further investigation, another website (lets say website B) using the same tcl version and openacs version did not have the same problem with entering a string in the query. I.E. it would complain:

We had a problem with your input:

group_id is not an integer

Please back up using your browser, correct the above , and resubmit your entry.

Thank you.

Whereas in website A (if I enter category?cid=a),it returns 500 Internal Server Error. Tested string and email also; they work fine.

Not sure if it relates to the openacs core installation or the custom tcl code or aolserver configuration.

Jason, do the tests return on both systems the correct boolean value? To be sure, try to use the function ad_page_contract_filter integer from version 1.20 http://eye.openacs.org/viewrep/OpenACS/packages/acs-tcl/tcl/tcl-documentation-procs.tcl

don't you see any error message in the error log from the 500 error?

Gustaf,

After some more testing it seems I know where the problem lies. It seems the master template (blank-master.adp) contained

ad_page_contract {

} {

{item_id:integer 0}
{cid:integer 0}
{return_url ""}

}
which it did not agree with .. guess it did do the validation after all and it was the display template which had the 500 problem!

Jason

I have 2 files - test.tcl and blank-master.tcl

test.tcl

ad_page_contract {

} {
x:notnull,integer
}

ns_log Notice INSIDETEST
ad_return_template

blank-master.tcl

ad_page_contract {

} {
{x ""}
}

ns_log Notice BLANKMASTER

How might I pass value of x into blank-master in usual circumstances such as test?x=1 and test?x=a (this works;returns error) and also catch invalid urls such as test?x=1017%3C/p%3E%0D%0A%3Cp%3E (this causes server error).

A quick solution is to pass as a property in default-master and do integer validation further down in blank-master.

However, is there a coding convention that I am missing for passing queries to blank-master with better exception handling ?

The problem that I have had was rp processor displaying the error message within the master template which caught the same validation error thus resulting in a neverending error loop.