Forum OpenACS Q&A: Performance Problems - what is the bottleneck?

Hi all,

I have problems with the performance of our installation here I seek your advice because I don't know whether this is because of our architecture, or miss configuration of aolserver or database or small hardware.

What basically happens is that once I have more than 150 users logged in the response time for a page takes a minute or more which is very frustrating for users. Today we received an email saying that .LRN is simply "shit". I don't know what to do next. All I know is that from semester to semester the load is getting higher but the frustration as well.

Here our system details.

We have three linux boxes. One for an aolserver with database connection, one for a static aolserver and one for the database.

The database box never goes above 5-10%. The static server is also not very busy but the dynamic server can go upt to 99% and a load of 10 and more.

Any help or comment is appreciated.

Greetings,
Nima

---------------------------------
dotlrn (dynamic server)
---------------------------------

AOLServer 4.0.10 (connected to the database)
Pound 1.8.2 (as reverse proxy for ssl and load balancing)
Apache 2.0.53 (only redirect from 80 to 443 where pound is)

SuSE 9.2
Linux Linux version 2.6.8-24.18-smp (gcc version 3.3.4 (pre 3.3.5 20040809)) #1 SMP Fri Aug 19 11:56:28 UTC 2005
4 CPU Intel(R) Xeon(TM) CPU 3.06GHz , L2 cache: 512K
4 GByte RAM - Memory: 4070968k/4111296k available (2339k kernel code, 39528k reserved, 824k data, 252k init, 3193792k highmem)
2 Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet Cards

AOLServer config.tcl
...

ns_section ns/server/${server}/modules
ns_param libthread ${homedir}/lib/thread2.6.2/libthread2.6.2.so
ns_param nssock ${bindir}/nssock.so
ns_param nslog ${bindir}/nslog.so
ns_param nssha1 ${bindir}/nssha1.so
ns_param nscache ${bindir}/nscache.so
ns_param php ${bindir}/libphp4.so
ns_param nsrewrite ${bindir}/nsrewrite.so
ns_param nscgi ${bindir}/nscgi.so

ns_section ns/parameters
ns_param serverlog ${serverroot}/log/error.log
ns_param home $homedir
ns_param maxkeepalive 100
ns_param logroll on
ns_param maxbackup 5
ns_param debug $debug
ns_param mailhost localhost
ns_param proxy www-cache.uni-mannheim.de
ns_param proxy_port 3128
ns_param ReverseProxyMode true

ns_section ns/threads
ns_param mutexmeter true
ns_param stacksize [expr 256*1024]

ns_section ns/server/${server}
ns_param directoryfile $directoryfile
ns_param pageroot $pageroot
ns_param maxconnections 100
ns_param keepalivetimeout 15
ns_param maxdropped 0
ns_param maxthreads 65
ns_param minthreads 65
ns_param threadtimeout 3600
ns_param globalstats true
ns_param urlstats true
ns_param maxurlstats 1000

ns_section ns/server/${server}/module/nssock
ns_param timeout 120
ns_param address $address
ns_param hostname $hostname
ns_param port $httpport
ns_param maxinput 20971520

ns_section ns/db/pool/pool1
ns_param maxidle 1000000000
ns_param maxopen 1000000000
ns_param connections 100
...

ns_section ns/db/pool/pool2
ns_param maxidle 1000000000
ns_param maxopen 1000000000
ns_param connections 60
...

ns_section ns/db/pool/pool3
ns_param maxidle 1000000000
ns_param maxopen 1000000000
ns_param connections 30
...

ns_section ns/server/${server}/module/nslog
ns_param debug false
ns_param dev false
ns_param enablehostnamelookup true
ns_param file ${serverroot}/log/${server}.log
ns_param logcombined true
#ns_param extendedheaders COOKIE
#ns_param logrefer false
#ns_param loguseragent false
ns_param maxbackup 1000
ns_param rollday *
ns_param rollfmt %Y-%m-%d-%H:%M
ns_param rollhour 0
ns_param rollonsignal true
ns_param rolllog true

---------------------------------
dotlrn5 (static server)
---------------------------------
AOLServer 4.0.10 (static)

SuSE 9.2
Linux version 2.6.8-24.18-smp (geeko@buildhost) (gcc version 3.3.4 (pre 3.3.5 20040809)) #1 SMP Fri Aug 19 11:56:28 UTC 2005
2 CPU Intel Pentium III (Katmai) 500 MHz
512 MB RAM - Memory: 512744k/524288k available (2339k kernel code, 11100k reserved, 824k data, 252k init, 0k highmem)
2 3Com Corporation 3c905B 100BaseTX [Cyclone] Cards

---------------------------------
dotlrn2 (database server)
---------------------------------
PostgreSQL 7.4.8

SuSE 9.2
Linux version 2.6.8-24.18-smp (geeko@buildhost) (gcc version 3.3.4 (pre 3.3.5 20040809)) #1 SMP Fri Aug 19 11:56:28 UTC 2005
4 CPU Intel(R) Xeon(TM) CPU 3.06GHz
4 GB RAM - Memory: 4071012k/4111296k available (2339k kernel code, 39484k reserved, 824k data, 252k init, 3193792k highmem)
2 Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet Cards

Postgresql configuration:

tcpip_socket = true
max_connections 200
shared_buffers = 15384
sort_mem = 16384
vacuum_mem = 131072
max_fsm_pages = 100000
max_fsm_relations = 1000
effective_cache_size = 786432
random_page_cost = 2.5
cpu_tuple_cost = 0.01
log_timestamp = true
lc_messages = 'de_DE.UTF-8' # locale for system error message strings
lc_monetary = 'de_DE.UTF-8' # locale for monetary formatting
lc_numeric = 'de_DE.UTF-8' # locale for number formatting
lc_time = 'de_DE.UTF-8' # locale for time formatting

Collapse
Posted by Dave Bauer on
If its with a large number of concurrent requests, it seems like you possibly don't have enough interpreters to immediately serve those requests.

Do you have free RAM on the box with AOLserver running? Try increasing maxthreads a little at a time to see if it improves.

I don't know if there is a way to see if there are requests waiting for a free thread. Hopefully someone with more AOLserver knowledge will be able to help.

Collapse
Posted by Dave Bauer on
Another thing I thought of:

Are there long-running queries that are blocking requests?

I don't know if a number of long running queries would show up as high load on the database server or not, if the database still had additional resource to serve more requests, but just had slow queries running at that point.

What excactly are you serving in you static site?
all the images? or are you still service images in you dinamic site? (verify your request logs)

In your .LRN dinamic server:
how many ram is consuming the nsd process?

In PG, increase:
shared_buffers = 65636
sort_mem = 32168

You're running vacuum analyze at least once a day, right?
pg_autovacuum is as well very good option.

Are your CR files stored in the file system?
if not, try to move all of them, and set your applications to use the fs and not the db.

Probably one effective way to increase performance, is to reduce the amount of requests to your dinamic server.
One way to achieve that is to put ALL your static files (images, css, etc) into the resources folder of each package, and then put them into your static site, and add a rule into pound to redirect /resources/* to the static site, so it goes like this:

/pkg1/www/images/img.gif
you move it to
/pkg1/www/resources/img.gif
and then you change your .adp /.tcl that refers to that img.gif
(href=/resources/pkg1/img.gif)
and put that files using the same paths in your static site
/www/resources/pkg1/img.gif
then add the rule to pound
/resources.*
and all of those files will be served from static site

After, you'll need a way to sync your dinamic and static site, but that's easy.

Do you have public areas of your .LRN site that are viewed by many users, and do not change so often?
(that sort of areas are good candidates to be replicated as static, and have procs to put them up-to-date after something has changed)

Next step will be to have a cluster.

How many concurrent connections do you have at peak times?
Try this in your dinamic system:
while [ 1 ]; do echo -n `date` >> tcp_report.txt; echo -n ' | ' >> tcp_report.txt; lsof | grep 8000 | grep TCP | grep -v LISTEN | wc -l >> tcp_report.txt; done;
(thanks to Derick)

Also look for blocked processes in PG (something that is waiting or blocked for too long sometimes reduces the performance, you can kill it). ps axuf

Collapse
Posted by Nima Mazloumi on
how many ram is consuming the nsd process?
41% = 1.8 GB

shared_buffers = 65636
sort_mem = 32168
What is the shmax according to these figures?

You're running vacuum analyze at least once a day, right?
once every 24h

pg_autovacuum is as well very good option.
Where do I configure that?

Are your CR files stored in the file system?
yes

One way to achieve that is to put ALL your static files (images, css, etc) into the resources folder of each >package, and then put them into your static site
already done

add a rule into pound to redirect
already done

Next step will be to have a cluster.
like several dynamic server connected to one database?

How many concurrent connections do you have at peak times?
Try this in your dinamic system:
while [ 1 ]; do echo -n `date` >> tcp_report.txt; echo -n ' | ' >> tcp_report.txt; lsof | grep 8000 | grep TCP | grep -v LISTEN | wc -l >> tcp_report.txt; done;
(thanks to Derick)

Also look for blocked processes in PG (something that is >waiting or blocked for too long sometimes reduces the >performance, you can kill it). ps axuf
how can I kill a blocked postgresql process securely?
Collapse
Posted by Andrew Piskorski on
See also the Performance problems with large installation thread Nima started on the AOLserver list.
Collapse
Posted by Steve Manning on
In PG, increase:
shared_buffers = 65636
sort_mem = 32168

I would be careful of setting shared_buffers so high. Several Pg tuning articles suggest that this can cause performance degradation. Its not a straight forward calculation however so testing different values might be worthwhile. Setting shared_buffers to about 6% of memory (about 31457 on a 4Gb machine) should be starting point especially when used in conjunction with effective_cache_size set to a decent value.

A decent effective_cache_size allows the query planner to uses indexes instead of sequential scans. On a mchine dedicated to Pg this value can be set very high between 2/3 and 3/4 of RAM seems to be the suggestion - Nima has this set to 786432 which equates to 6Gb - might be a tad high on a 4Gb machine. Again experimentation to find a good value seems to be the order of the day.

- Steve

Collapse
Posted by Denis Roy on
If you only have 65 threads available, it also means that only 65 user requests can get served by AOLserver. In theory, if you have page response times of less than 1 sec, 65 threads can potentially serve many more users (at least up to 150-200 users). The problem that kicks in, though, is that you have response times of >1 min.

So in an extreme case, you will have 65 users waiting for the page to load, another 35 users in the wait queue (maxconnections = 100) and the remaining 50+ users getting an "Unable to connect to _your server_".

You will have to find out what the primary reason for the huge response times are. Does the application wait to get a free db handle, does it wait for queries to complete, does the RP take a long time, or is it maybe just the browser waiting for a connection to the AOLserver? If you have developer-support installed, it will probably give you some valuable data.

Next step depending on where the problems is might include:

  • increase maxthreads (I believe you can easily increase it to 80+ on such a big server)
  • optimize pages/queries
  • put as much data as possible into cache

We have seen similar problems for AIESEC, and I can tell you that the solution was mostly in application tuning, assuming you have the right database setup. The biggest performance impacts we have seen are related to the portal system and the permissioning system which take up a lot of database time. We are currently upgrading to OpenACS5/.LRN2 but I am still sure that those will be the most expensive pages/queries on the system.

Collapse
Posted by Don Baccus on
Comments regarding database configuration (such as Steve Manning's) and the cost of permissioning and the portal system regarding the database (Denis Roy's) seem to miss the crucial point:

The database server's sitting mostly idle. It's the dynamic AOLserver box that's hosed with a load going over 10.

I have no solution to offer but Nima, it seems obvious you should be concentrating on the box that's the busiest, your Tcl box, not the database server or static server.

Profiling with developer support may help you gather the necessary information, along with AOLserver stats gathering. Without more information I can not offer any advice.

Collapse
Posted by Denis Roy on
My main point was that Nima has to find out what the primary reasons for his extremely slow page response times are, possibly using developer support (and AOLserver stats as Don mentioned).

The comments about the database were mainly to illustrate that application tuning was the most crucial thing for us. Even though we have a strong quad-Opteron database server which is usually about 50-70% idle and the load rarely goes over 1.5 or 2, we do have some pages that take a very long time to execute because there are e.g. many huge tables being joined and/or very many permission checks on those pages. Btw, when I mentioned the portal system, I didn't mean the portal system as such but rather some portlets which run very expensive queries.

So based on our experience, I would support what Dave has suspected earlier: Long-running queries can result in slow response times even though the load on the db server is low. And this in turn can block available AOLserver threads which will mainly become a problem during peak times with many concurrent users.

I didn't mean to say, though, that this or any other reason is causing all the trouble. That, as mentioned, needs more careful analysis of the entire system.

The hypothesis isn't supported by the fact that the dynamic server's busting its balls. If long-running queries were causing threads to stall leading to too few threads being available to service connection requests, one would expect the dynamic server to be waiting for the busy database server to complete those long-running queries. But there is no busy database server ...

If it were me, I'd concentrate on that Tcl server and not worry about anything else until I had a reasonable understanding as to why it is so heavily loaded.

Collapse
Posted by Denis Roy on
Nima, if you have a spare server with an AOLserver instance (or where you can install it), here is another thing you could try first before diving deeper into system statistics gathering:

A fast way to find out whether your problem is solely your dynamic server box would be to set up a second AOLserver on a separate physical server to access your database and fire it up.

You don't have to worry about a cluster yet as you don't have to put it into production. Just browse your site using the second instance during peak times. Since you will have 150+ users on your production site, but only 1 user on the second system, you will see how fast the response times on your second server compared to the production server.

If the response times are much faster as you would usually expect them on your production site without too many users, then concentrate on the dynamic server first. If you still have longer response times than expected, you will have to keep gathering more information and statistics first as mentioned.

The cluster functionality has been just updated for a fix to the bug #2396, thanks to Victor Guerra who did the patch.

So now, anyone should be able to run clusters of openacs without troubles of caching.

This bug fix can be applied separately to any openacs installation (oacs-5-1).

http://cia.navi.cx/stats/author/victorg/.message/11c18

Enjoy!

Collapse
Posted by Rocael Hernández Rizzardini on
We're seeing as well slow performance of some queries, even if our box is idle at the time of the request (dual xeon 3.0Ghz with 4GB RAM, DB only).
Mostly happening with permissions, but some other queries are very slow in a portlet fashion (lors, calendar, fs).

When checking the permission, I have seen that when the result will return 0, meaning denied permission, it takes about 1500msec to return the query. (acs_permissions rows 1679466)

One thing that improved our performance on that specific issue with permissions is that we extended the life of the permission for 1 day (kernel param-> PermissionCacheTimeout), although that is not a fix or a real enhancement in the DB side. Always check that the chache amount designated to memoize is big enough, if not you'll end having a lot of flushing (kernel param-> MaxSize), check your memoize hit rate at:
/acs-admin/cache/

Collapse
Posted by Don Baccus on
Roc, would you post an example query that's taking 1.5 seconds to fail a permissions check, along with the Postgres plan that's being generated?
Hello!

We have too many slow queries. Rewriting sql can increase 10x times or more the speed. The worse thing I think of the cost of the queries is postgres version dependent. Sniff.

(Our DB have a similar size than Roc DB.)

Regards,
Agustin

1310 ms 0or1row dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1 from dual
where exists
( select 1
from acs_object_party_privilege_map ppm
where ppm.object_id = '1228' and ppm.party_id = '24471989' and ppm.privilege = 'dotlrn_browse' )

Now, with explain analyze:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=3655.01..3655.04 rows=1 width=0) (actual time=3377.385..3377.402 rows=1 loops=1)
   One-Time Filter: $0
   InitPlan
     ->  Nested Loop  (cost=6.35..3655.01 rows=38 width=0) (actual time=3377.341..3377.341 rows=1 loops=1)
           ->  Hash Join  (cost=6.35..2462.49 rows=198 width=4) (actual time=0.616..1953.102 rows=82802 loops=1)
                 Hash Cond: (("outer".privilege)::text = ("inner".privilege)::text)
                 ->  Nested Loop  (cost=0.00..2446.33 rows=626 width=16) (actual time=0.071..1060.517 rows=82804 loops=1)
                       ->  Index Scan using acs_object_context_index_pk on acs_object_context_index c  (cost=0.00..39.90 rows=13 width=4) (actual time=0.035..0.040 rows=2 loops=1)
                             Index Cond: (object_id = 1228)
                       ->  Index Scan using acs_permissions_pk on acs_permissions p  (cost=0.00..184.49 rows=50 width=20) (actual time=0.017..214.914 rows=41402 loops=2)
                             Index Cond: ("outer".ancestor_id = p.object_id)
                 ->  Hash  (cost=6.34..6.34 rows=4 width=14) (actual time=0.325..0.325 rows=0 loops=1)
                       ->  Seq Scan on acs_privilege_descendant_map pdm  (cost=0.00..6.34 rows=4 width=14) (actual time=0.252..0.302 rows=2 loops=1)
                             Filter: ((descendant)::text = 'dotlrn_browse'::text)
           ->  Index Scan using party_approved_member_map_pk on party_approved_member_map pamm  (cost=0.00..6.01 rows=1 width=4) (actual time=0.010..0.010 rows=0 loops=82802)
                 Index Cond: ((pamm.party_id = "outer".grantee_id) AND (pamm.member_id = 24471989))
   ->  Subquery Scan dual  (cost=0.00..0.02 rows=1 width=0) (actual time=0.028..0.038 rows=1 loops=1)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.020..0.023 rows=1 loops=1)
 Total runtime: 3377.578 ms
(19 rows)

In this case, is giving lots of rows. I don't remember right now, I think it was as well that we needed a way to stop the query execution when just one row is returned. I was diging into this like 2 months ago, we might go further into this now.

limit 1 doesn't do the magic.
we're running PostgreSQL 7.4.7 on i386-pc-linux-gnu
Collapse
Posted by Gustaf Neumann on
i would check the join conditions such they have indices.
does an index for party_approved_member_map.party_id help?
I tested many new index combinations without good results. =(
But I need to check exactly which were the index combinations tested.
The hash join is killing you, it returns 82804 rows while PostgreSQL expected 198 rows. It looks you need a "VACUUM ANALYZE", if it doesn't help you need to configure PostgreSQL to gather more statistics while analyzing (it is a setting somewhere, read PostgreSQL docs).
Yes, I believe Daniel's right, good eye there.

Roc, what does the query:

select count(*)
from acs_object_party_privilege_map ppm
where ppm.object_id = '1228' and ppm.party_id = '24471989' and ppm.privilege = 'dotlrn_browse';

return?

Also .LRN and several OpenACS packages declare far too many custom privileges. I've mentioned this before and in fact have cleaned up a couple of packages. The hierarchy map is always short enough to be sequentially scanned. Cutting the number of privileges in the system in half would cut the cost of joining against the hierarchy table by half as well, on average.

Something folks should think about while working on new packages especially (please, don't define new versions of "read" or "write" when plain old "read" or "write" will work!)

This isn't the cause of your extremely slow query, rather I'm pointing out (again) that in those cases where permissions queries are reasonably fast, they could be made a couple of times faster by reducing the number of unnecessary custom privileges.