Forum OpenACS Development: Developer Support Database Statistics

Collapse
Posted by Dave Bauer on
I have a habit of using the Developer Support Toolbar with database statistics turned on to track slow queries, and to get the queries with variables replaced for debugging.

This works ok for simple pages, but for complex pages with many queries, this can cause the server to slow down and I have been noticing nsv mutex locks which can apparently create a deadlock or take forever to resolve.

A small sample for a log:

Mutex lock nsv:7:hub: wait duration 4.538288
Mutex lock nsv:7:hub: wait duration 3.038703
Mutex lock nsv:7:hub: wait duration 3.207191
Mutex lock nsv:7:hub: wait duration 3.314921
Mutex lock nsv:7:hub: wait duration 3.149039
Mutex lock nsv:7:hub: wait duration 0.256383
Mutex lock nsv:7:hub: wait duration 3.145087
Mutex lock nsv:7:hub: wait duration 0.705920
Mutex lock nsv:7:hub: wait duration 2.478933
Mutex lock nsv:7:hub: wait duration 2.436423
Mutex lock nsv:7:hub: wait duration 0.116127
Mutex lock nsv:7:hub: wait duration 2.629668
Mutex lock nsv:7:hub: wait duration 2.489575
Mutex lock nsv:7:hub: wait duration 1.663972
Mutex lock nsv:7:hub: wait duration 1.337092
Mutex lock nsv:7:hub: wait duration 0.636399
Mutex lock nsv:7:hub: wait duration 2.343290
Mutex lock nsv:7:hub: wait duration 0.104463
Mutex lock nsv:7:hub: wait duration 2.289196
Mutex lock nsv:7:hub: wait duration 0.473416
Mutex lock nsv:7:hub: wait duration 2.608169
Mutex lock nsv:7:hub: wait duration 0.256684
Mutex lock nsv:7:hub: wait duration 2.790731
Mutex lock nsv:7:hub: wait duration 1.190696
Mutex lock nsv:7:hub: wait duration 2.702606
Mutex lock nsv:7:hub: wait duration 1.303886
Mutex lock nsv:7:hub: wait duration 2.697363
Mutex lock nsv:7:hub: wait duration 2.382554
Mutex lock nsv:7:hub: wait duration 1.463610
Mutex lock nsv:7:hub: wait duration 3.224199
Mutex lock nsv:7:hub: wait duration 0.821237
Mutex lock nsv:7:hub: wait duration 2.802242
Mutex lock nsv:7:hub: wait duration 1.388523
Mutex lock nsv:7:hub: wait duration 2.296387
Mutex lock nsv:7:hub: wait duration 0.823698
Mutex lock nsv:7:hub: wait duration 2.502012
Mutex lock oacs:sched_procs: wait duration 1.411537
Mutex lock ns:sched: wait duration 0.139955
Mutex lock nsv:3:hub: wait duration 0.128161
Mutex lock nsv:3:hub: wait duration 0.759700
Mutex lock nsv:3:hub: wait duration 0.743639
Mutex lock ns:sched: wait duration 0.649260
Mutex lock ns:sched: wait duration 0.383575
Mutex lock nsv:2:hub: wait duration 2.194089
Mutex lock nsv:2:hub: wait duration 0.533795

Collapse
Posted by Gustaf Neumann on
Hi Dave,

the problem are not complex pages, but the overall load form *all* queries that add tracing information to the shared nsv variables that the developer support uses. One will see such messages only from busy sites, and not when running complex queries on non-busy sites.

I've added the "wait duration" output a few years ago to naviserver to raise the awareness of developers, what's going on. It is actually a very nice feature compared to other systems, which are silently coming to a crawl in such situations. You should consider using nsstats [1] and the munin plugins [2], where nsstats provides among other things a mapping from nsv-names to mutexes, and the munin plugins are nice to show, what is for a site a usual/unusual traffic of mutex locks, busy mutex calls, mutex waits, etc.

The busy mutex locks on nsvs will not create deadlocks. However, it is not recommend to let the developer support (with db query tracking) running on busy site for longer time, since it will slow down the system. On the longer range, we could consider to add optionally a per-session mode to the developer support, such that it tracks only the queries of one session, and not the queries of the overall system. That would improve the lock congestion on the nsvs considerably, while covering some of the important use-cases.

all the best
-g

[1] https://bitbucket.org/naviserver/nsstats
[2] https://github.com/gustafn/munin-plugins-ns

Collapse
Posted by Dave Bauer on
Thanks.

I guess I explained it incorrectly. These aren't complex queries but its a script that processes thousands of records with short queries, so its just the total number of queries running. Thanks for the explanation. To debug or learn about the query statistics of such a script, a test mode which processes just a few rows makes more sense.

I will check out nsstats, that is helpful.

Collapse
Posted by Gustaf Neumann on
Hi Dave,

in terms of the locks, it makes no difference, whether a complex query or some maintenance script is run in a connection thread. The point is, when only one connection thread is active, then there are no locks from the developer support. If multiple connection threads are writing to the variables of the developer support, then locks must occur. Therefore, the locks you experienced depend on the current load caused by other users. If we would confine the developer-support to one connection thread, this would hurt busy sites much less than the current approach, where ds affects all threads.

I've now added the mutex statistics as well to the munin statistics of openacs.org [1]. You can find there
- waiting time for locks (the biggest dent is a waiting time of 1.4ms during log-rolling)
- busy locks (currently 0.5 busy locks per second on average), and
- number of locks (currently 500 locks per second, peaks around 6.000 locks per second)
OpenAcs.org is quite an idle site, one other sites we see easily 10x these values (experiencing good performance). The values of munin are moving averages in 5 minute intervals. One can configure additional line for more mutexes in the munin scripts, when e.g. the lock statistics of nsstats indicate this. The munin graphics are useful to increase the understanding of the factors influencing the perceived performance, and to detect unusal behavior requiring attention.

-g
[1] https://openacs.org/munin/localdomain/localhost.localdomain/index.html#naviserver