We schedule Watchdog to run every 5 minutes. However, last Saturday, it didn't run at all for this 3 hour period:
[15/Nov/2008:13:35:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors... [15/Nov/2008:13:40:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors... [15/Nov/2008:14:22:07][14058.1094724224][-conn:outpost-prod::23] Error: nsoracle.c:2994:Ns_OracleOpenDb: error in `OCISessionBegin ()': ORA-00257: archiver error. Connect internal only, until freed. [15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors... [15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Errors found. [15/Nov/2008:16:48:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
The reason is that Oracle error in the middle there. Our database broke, and was returning errors to all queries. We fixed that, and then Watchdog quickly emailed us about all the errors in our AOLserver log, too late to do any good.
So, I think Watchdog currently depends on the database being available, and silently fails if it's not.
The little
wd_email_frequency
helper proc (which I may have written) looks suspicious, as it's call
to ad_parameter can implicitly do a database query. I
bet there are other places in Watchdog where a broken database will
stop all error reporting.
What do you think is the best approach to fixing this?
Calls like ad_parameter make possible database
dependencies harder for me to understand. ad_parameter
and friends (parameter::get, ad_parameter_cache,
etc.) clearly support caching of fetched values in some fashion, but
I don't know if or when it is ever safe to conlude that the cache will
definitely be populated, and that it's thus safe to use these sorts of
calls from Watchdog.
My instinct is to entirely eliminate all database use entirely from
Watchdog, and fetch any necessary settings (like
WatchDogFrequency) solely from the AOLserver config file
via ns_config.
To test any of that, I'd want to selectively break database access in
the Watchdog thread, probably by redefining some key DB API call to
fail (or log warnings). That would be useful for both tracking down
database dependencies in the first place, and eventually verifying
that Watchdog works even if the database is broken. What would be the
best place to add that instrumentation, perhaps either
in db_exec, or ns_db itself?
Request notifications