Forum OpenACS Q&A: Lazy site node caching

Collapse
Posted by Malte Sussdorff on
I managed to get lazy site node caching to work, so you do not have to load all site nodes during startup and do not have problems when doing clustering.

Sadly I am not happy with the solution as it requires a trip to the database. Why ?

The method I use is to go from the root node and for each subsequent node in the URL I will load the subpart into the database if it is not already there. I stop the moment I am not finding the next part in the database and return with the previous node. As you can see this results in any case in a trip to the database, unless we are querying the root of a package. Why is that? The site node cache is only for packages and folders, it does not take into account the file names. So if you have /file-storage/folder-view, /file-storage would be in the site_nodes table, folder-view would not. But there is no way for me to predetermine if it is or not unless I hit the database.

The next thought I have now is to add a "no_children_p" entry to the site_nodes NSV. This will tell me that there are no more children for the particular site_node I am looking at so I do not have to do the last look at the database. As the site node is flushed whenever I add a child, I should be save from making a lot of code changes just to keep track of the variable.

Does this approach make sense to you? Did I miss something? And does this affect the XoTCL request processor, do I need to make changes to xotcl-core as well, as I am changing site_node::get_from_url as the only place to query the site_nodes NSV at the moment.

Collapse
2: Re: Lazy site node caching (response to 1)
Posted by Malte Sussdorff on
The site node caching works now with a limited number of trips to the database. TO achieve this though I had to use a little trick. I am assuming that you are NOT mounting anything below "admin" "resources" "shared" "x" and "pvt". Reason is that they are directories in acs-subsite and if I did not exclude these, every request to e.g /resources/list.css would trigger a database query, as the "/" node will have children and this is the criteria to start lookup in the database.

Another thing we should not do is to call "site_node::get_children -all /", as I will have to update the cache *including children* for the "/" node, to be sure to find the children. The alternative would be to rewrite the whole procedure to directly query the database all the time. Comments ?

Collapse
Posted by Jose Agustin Lopez Bueno on
Hi, Malte!

We are very interested on your work.
One of the problems with our cluster
(well, THE PROBLEM I said) is with site node
cache.

If you want we test some code, please tell us.

Thanks,
Agustin

Collapse
4: Re: Lazy site node caching (response to 1)
Posted by Malte Sussdorff on
The lazy site node caching is now committed to HEAD (acs-tcl). It is tested on a SINGLE postgres driven website, no cluster, no oracle. => I would not be surprised to find problems.

I will make a slow rollout to other testing sites of mine, and start working on a cluster soon. But if Agustin could test this as well I would be delighted.

It should get rid of your need to synchronize the site node cache between the cluster nodes EXCEPT when renaming or deleting a site_node. But for that you could probably write a quick script that calls all other servers and removes the specific site_node from the array.

Collapse
Posted by Jose Agustin Lopez Bueno on
Hi, Malte!

We are testing it and I get the first problem. The function
dotlrn::is_package_mounted is not detecting the attach mounted below dotlrn package in the OpenACS start process.
Next is the error log:

[13/Mar/2007:12:56:41][14966.16384][-main-] Notice: Loading packages/dotlrn/tcl/dotlrn-init.tcl...
[13/Mar/2007:12:56:41][14966.16384][-main-] Notice: dotlrn-init: starting...
[13/Mar/2007:12:56:41][14966.16384][-main-] Notice: dotlrn-init: attachments being automounted at /dotlrn/attach
[13/Mar/2007:12:56:41][14966.16384][-main-] Notice: dotlrn::mount_package: object_type apm_package url /dotlrn/ object_id 2121 instance_name dotLRN
package_type apm_application package_id 2121 name dotlrn node_id 2120 has_children_p 1 directory_p t package_key dotlrn pattern_p t parent_id 498
NOTICE: adding missing FROM-clause entry for table "acs_object_id_seq"
CONTEXT: PL/pgSQL function "acs_object__new" line 17 at SQL statement
PL/pgSQL function "site_node__new" line 23 at assignment
[13/Mar/2007:12:56:41][14966.16384][-main-] Error: Ns_PgExec: result status: 7 message: ERROR: duplicate key violates unique constraint "site_nodes_un"
CONTEXT: SQL statement "INSERT INTO site_nodes (node_id, parent_id, name, object_id, directory_p, pattern_p) values ( $1 , $2 , $3 , $4 , $5 , $6 )"
PL/pgSQL function "site_node__new" line 35 at SQL statement

[13/Mar/2007:12:56:41][14966.16384][-main-] Error: dbinit: error(pizarradb.uv.es:5433:openacsdb_5_2_desa,ERROR: duplicate key violates unique constraint
"site_nodes_un"
CONTEXT: SQL statement "INSERT INTO site_nodes (node_id, parent_id, name, object_id, directory_p, pattern_p) values ( $1 , $2 , $3 , $4 , $5 , $6 )"
PL/pgSQL function "site_node__new" line 35 at SQL statement
): '

select site_node__new(NULL,'2120','attach',NULL,'t','t',NULL,NULL)
...

Regards,
Agustín

Collapse
Posted by Jose Agustin Lopez Bueno on
Other point.

If we generate one community the speed is more fast
but we can not access to that group since one server
restart.

(we are doing the tests in a cluster with only
one member)

Agustín

Collapse
Posted by Jose Agustin Lopez Bueno on
This patch for acs-tcl/tcl/site-nodes-procs.tcl
(line 551) in function site_node::get_from_url
resolve create new community problem (the new community
is not show without a server restart):

if {[catch {nsv_get site_nodes "${new_url}/"} result] == 0} {
set node_id ""
} else {
if {$new_node(has_children_p) && [lsearch $acs_subsite_dir_list $name] == -1} {
set node_id [db_string node_id "select node_id from site_nodes where parent_id = :parent_id and name=:name" -default ""]
ns_log Debug "Loading from the database $test_url $name $parent_id"
} else {
set node_id ""
}
}

Collapse
Posted by Malte Sussdorff on
Simple answer: .LRN sucks. More correct answer:

.LRN makes use of the NSV Array directly. I fixed the issues during startup, but the fact remains that .LRN has rewritten site nodes in dotlrn/tcl/site-nodes-procs. These procedures should not be in the .LRN package in the first place. But they are there and I don't have the time (neither the client the budget as he isnt using .LRN) to fix this at the moment.

Here is the fix for .LRN initialization

===================================================================
RCS file: /cvsroot/openacs-4/packages/dotlrn/tcl/applets-procs.tcl,v
retrieving revision 1.20
diff -r1.20 applets-procs.tcl
38c38
< if {[nsv_exists site_nodes "[get_url]/"]} {
---

if {[site_node::get_node_id -url "[get_url]/"] ne ""} {
Index: tcl/dotlrn-procs.tcl
===================================================================
RCS file: /cvsroot/openacs-4/packages/dotlrn/tcl/dotlrn-procs.tcl,v
retrieving revision 1.75
diff -r1.75 dotlrn-procs.tcl
108d107
< FIXME: refactor
110,122c109,115
< set dotlrn_ancestor_p 0
< set package_list [nsv_array get site_nodes "[get_url]/${package_key}*"]
<
< for {set i 1} {$i < [llength $package_list]} {incr i 2} {
< array set package_info [lindex $package_list $i]
<
< if {[site_node_closest_ancestor_package -default 0 -url $package_info(url) [package_key]] != 0} {
< set dotlrn_ancestor_p 1
< break
< }
< }
<
< return $dotlrn_ancestor_p
---
set site_node [site_node::get_node_id -url "[get_url]/${package_key}/"]
if {$site_node eq ""} {
return 0
} else {
return 1
}
Collapse
Posted by Jose Agustin Lopez Bueno on
Ok.

Your code works. At this moment all works except:

-Some portles displaying the msg (example):
Error in include template "/var/lib/aolserver/oacs_5_2/packages/lorsm/lib/user-lorsm": can't read "url_by_node_id(17673735)": no such element in array

-xotcl:
site_node::get "must pass in either url or node_id"
while executing
"error "site_node::get \"must pass in either url or node_id\"""
(procedure "site_node::get" line 4)
invoked from within
"site_node::get -url $mount_url"
(procedure "::Generic::package_id_from_package_key" line 4)
invoked from within
"::Generic::package_id_from_package_key xotcl-request-monitor"

I will like resolve these problems before test the performance in our true cluster.

Any pointer?
Agustín

Collapse
Posted by Malte Sussdorff on
package_id_from_package_key makes a round trip to the site node cache to figure out the package_id. Not sure why, but thats how it is. You can easily circumwent this by changing the way xotcl requests the package_id or by running site_node::update_cache with the node_id of the object = package_id from the request processor on initialization of the request processor.

I dont have a checkout of LORSM from 5.2, but in HEAD I could not find anything which could cause this behaviour. If you could specify the error a little bit larger (not only this message but the whole text from the error log) then i might be able to help more.

In general there should be no direct nsv call to the site node. In HEAD it is done using site_node::get_url_from_object_id and this should work just fine (at least the procedure should be unable to fail), but maybe you have a culprit there.

Hi Malte!

We are working in our production cluster with your code
with some small mods. The speed are very increased.
Thanks!

We have since 400 concurrent connections.
If anybody want to know how is our system:

http://aulavirtual.uv.es/ficheros/view/imagenes%5C/CLUSTER_AulaVirtual_pub.gif

NOTES:

Another patch (line 360, site-nodes-procs.tcl):
db_foreach $query_name {} {
if {$parent_id eq ""} {
# url of root node
set url "/"
} else {
# append directory to url of parent node
if { [info exists url_by_node_id($parent_id)] } {
set url $url_by_node_id($parent_id)
} else {
set url [db_string snid {select site_node__url(:parent_id)} -default ""]
}
append url $name
if { $directory_p eq "t" } { append url "/" }
}

Regards,
Agustin

Collapse
Posted by Jose Agustin Lopez Bueno on
Hi!

Another thing about clustering:

We have a modified whos-online (we name it cwhos-online)
for show all the connected users in the cluster.
It is like of whos-online but with three parameters.
We do not show cluster users online in top bar by performance
decrease.

If anybody is interested we can publish the code.
You can see it:
http://aulavirtual.uv.es/shared/cwhos-online
or
http://aulavirtual.uv.es/shared/cwhos-online?show_ips=t&unique=t&moreinfo=t

Regards,
Agustín

THis is excellent news Agustin. If you have the time maybe you can patch the file in CVS HEAD directly and upload it? This would be great.

.LRN Honchos. Is it okay to include the patch above in 5.3 or should I only commit this to the HEAD version of .LRN?

Collapse
Posted by Gustaf Neumann on
changed lookup of package_id from package key such it does not need the site nodes in xotcl-core (cvs HEAD).
Collapse
Posted by Malte Sussdorff on
I realized the hard way that the code has a couple of issues left, mainly with the number of times it is hitting the database and the number of times the side_node cache is updated for a specific node_id.

The main problem here is that from the URL you cannot detect if you are looking at a site_node or a folder within a package.

So, if you go from top to bottom, I already implemented a check for "has_children_p". Sadly this does not help if you work with multiple subsites e.g /malte. /malte/register would look at /malte, see it has children (e.g. /malte/photo), therefore try to load /malte/register from the database, which will obviously fail because /register is a directory, not a site_node.

Next idea is to try and load the children of a site_node not mounted under / and /dotlrn/, /dotlrn/clubs, /dotlrn/communities (you get the picture). Then I could say "children_loaded_p" and if I come from the top and hit a "children_loaded_p" in the site_node cache then I assume that all children have been loaded and go the usual way of finding a site_node, which assumes that it is in the site_node cache.

Only if this fails I would try to reload the site_node cache for the last node (from bottom to top) that says "children_loaded_p". This is due to the reason that you might want to mount a new package under an already loaded site_node, which would then not be found by the cluster.

ANy other idea is highly welcome though, but I am out of ideas for how to make good assumption if I need to hit the database for a URL like /malte/forums/moderate/edit or not.

Collapse
Posted by Malte Sussdorff on
I created a new version of lazy site node caching which roughly follows what I have been describing. Remaining issues are though:

a) Unnecessary hits to the database. Due to the fact that files can reside under /packages/acs-subsite/www and be mounted under that package, we always hit the database unless we would announce for the / that we loaded all children. Which excactly is not the purpose of lazy caching. So the occasional hit to the DB is in there. Same is true for directories under /www.

To solve this issue I have come up with a list of reserved names under / which will not trigger a trip to the database. At the moment these are the usual suspects "resources images image x o file admin". As this is a list of names you are not allowed to use when mounting a package, maybe we can get this list complete and also make sure that noone comes with the idea of mounting a package with that name in the first place?

b) Much more testing needed. I have it now on a busy production site, but this is using acs-subsite a lot and does not make use of dotlrn. So maybe someone with a busy .LRN site could test it again? Also I'm not sure if all the issues reported by Gustaf have been fixed.

Collapse
Posted by Jose Agustin Lopez Bueno on
Hi, Malte.

We are testing a bit your code.
Only we have detected some problems:

-The images in
http://www.xx.xx/dotlrn/configure
for up, down, delete, ... are not shown
until we refresh the cache with
http://www.xx.xx/acs-admin/cache

Same behaviour with urls
portal/admin/portal-config?portal_id=xxx
below
http://www.xx.xx/dotlrn/admin/portal-templates

-We are geting the next messages in the log:
[30/May/2007:09:38:51][23199.1106975072][-conn:20-] Notice: Loading from the database /dotlrn/classes/c032/12254/c07c032a12254gH c07c032a12254gH 4890528
[30/May/2007:09:38:54][23199.1106975072][-conn:20-] Error: Tcl exception:
can't read "node(url)": no such element in array
while executing
"string equal $node(url) "[ad_conn url]/""
(procedure "rp_filter" line 92)
invoked from within
"rp_filter preauth"
...

Regards,
Agustin

Collapse
Posted by Malte Sussdorff on
The first one is really strange. It should find the images no problem. What are the links to these images? And did you use the latest version from HEAD (as I initialize all non subsite packages)? Your error message does not indicate that 😊.

Furthermore, run the server in Debug mode. THis will help much more in getting to the root of the problem. As for the portal stuff, please find out where the code with "cn't read node(url)" is actually comming from, which procedure was called where that tried to find the node.

Collapse
Posted by Emmanuelle Raffenne on
Hi Agustin and Malte,

I suspect this problem comes from the site_nodes library that both dotlrn and new-portal implement (yes! it's duplicated in those 2 packages...). This should be removed and the calls to those procs replaced by calls to site-node ones (acs-tcl).

I've planned to work on this some time during the next 2 weeks. Not sure I will have time though.

Collapse
Posted by Malte Sussdorff on
Emma, thanks for this information. I guess it really is time for a cleanup 😊. If you have time that would be wonderful, especially as I don't have a larger .LRN installation on which to test this.
Collapse
Posted by Emmanuelle Raffenne on
Hi,

I've finished the cleanup of site_nodes and committed on HEAD. The duplicated libraries have been removed and calls to site_nodes::* replaced by calls to site_node::

I just did some smoke tests: dotlrn install, create communities, add applets, create users and add them to communities, etc. (where those calls were used) and everything seem to work correctly.

Hi, Emma!

Thanks for the work!

I have put the code in our development environment but I got
the next problem:

When I add one applet to one community appears Ok
but when you click the url of the new applet (in admin)
you get the error:

The requested URL was not found on this server.

When you restart Openacs, the url is working.

Regards,
Agustin

Collapse
Posted by Emmanuelle Raffenne on
Agustin,

Did you reinstall first? It won't fix what you describe but the dotlrn installation was broken because of site_nodes.

Anyway, I've noticed this behavior too. I was about to post about it after doing more tests.

On my side, I get a "file not found" when I go to the admin pages, but I don't need to restart aolserver, just to reload the page (I'm using firefox 2 on debian). It happens also on openacs when applications are added, so I guess there are still issues with caching.

Malte, any ideas?

Collapse
Posted by Emmanuelle Raffenne on
Agustin, Malte,

I've tested again. Here are the steps I followed:

- Standard install of dotlrn from HEAD
- created a new community
- went to calendar index page of the new created community: page displayed
- went to the admin index page of calendar in the new created community: File not found
- reload the page: page displayed
- went to the admin index page of forums in the new created community: File not found
- reload the page: page displayed

There are no error messages in logs.

Note: Same occurs in Openacs when I mount an application, users pages are correctly displayed but I get a "file not found" the first time I go to the admin page of the application.

After that, I've replaced acs-tcl/tcl/site-nodes-procs* with the last revision of oacs-5-3 branch and followed the same steps:

- created a new community
- went to calendar index page of the new created community: page displayed
- went to the admin index page of calendar in the new created community: page displayed
- went to the admin index page of forums in the new created community: page displayed

Collapse
Posted by Nima Mazloumi on
Just installed latest from head. There is still a minor bug. When I add an applet to a community and follow the link to the admin page of the applet then the aolserver complains that the resource was not found. If i refresh the site node is available.
Collapse
Posted by Torben Brosten on
Hi Malte,

Just a heads up.

I tried using the notes package and got the "file not found" for all the user pages except the one at the mount point (which apparently redirects to /add).

notes package mounted at /notes-test1 from a fresh db using cvs head (June 19) and no parameter changes from defaults.

Swapping out the site-nodes-procs* from oacs-5-3 (and reloading via apm) fixed the problem.

cheers,

Torben

Collapse
Posted by Torben Brosten on
..well.. fixed the notes problem anyway.

After re-installing the db using the oacs-5-3 version of site-nodes-procs* (to rule out any residual side effects), I see this:

/api-doc/ file not found
/api-doc/index file not found
/api-doc/index? the usual page shows up.

but that probably doesn't help much for test cases with the current code.

Am trying again with current cvs, since there appears to be some recent changes in those files..

Collapse
Posted by Torben Brosten on
current version works.. Wonderful! and thank you =)
Collapse
Posted by Emmanuelle Raffenne on
Torben,

By current version, do you mean 5.3?

I have the "file not found" problem but with HEAD code, not with 5.3 one.

Collapse
Posted by Torben Brosten on
Sorry for the confusion, Emmanuelle.

Current (June 30th) version of head is working fine for me. I'm using aolserver 4.0.10, pg8.1 on OS X 10.4.10.

Collapse
Posted by Malte Sussdorff on
Though I need to work on a lot of things for the next release of OpenACS (acs-mail-lite being the primary culprit) and ]po[ (upgrade to latest OpenACS version), please post *any* issues you have with the HEAD version of site-node caching here (I am not subscribed to bug-tracker, sorry...). Just make sure you run off the latest version from HEAD.
Collapse
Posted by Malte Sussdorff on
The /admin bug is fixed now in HEAD at least in the instances where I tested it 😊.
Collapse
Posted by Emmanuelle Raffenne on
Thanks Malte. Seems to work fine now.
Collapse
Posted by Gustaf Neumann on
Malte,

can you look at Test site_node::update_cache
test/admin/testcase?testcase_id=site_node_update_cache&package_key=acs-tcl&view_by=testcase&category=&quiet=0
Two tests are failing in the CVS head version. Seems as still some updates are needed.

Many thanks
-gustaf