Forum OpenACS Q&A: Call Doctor or Plumber? : Memory Leak

We've been running nsd8x AOLserver 3.2+ad12+ hollyjerry.org patch since late March... Jerry helped us set up multiple AOLserver instances and it has been only the past 3 weeks that we moved a site with moderate volume to this server (redhat 7.0). We also upgraded to Postgres 7.1.2

Last week, AOLserver stopped serving pages and the log ended with this entry:

...[-conn910-] Notice: dbinit: 
 sql(localhost::tgndata): ' select user_id, token, secure_token,
 last_ip, last_hit from sec_sessions where session_id = 882531  '
 nsthread(13272) error: ns_realloc: could not allocate 1455184 
 bytes

I just restarted all AOL servers (main and 2 virtual) and everything started working.... I didn't look any further to solve the WHY question.

Yesterday, while running the TOP command I noticed that the busy site server had consumed 70% of memory. I restarted them and it was back down to 3% but the busy site aolserver continues to consume memory... Here it is about 24 hours later and now up to 31%. Here is the most recent top:

  7:59pm  up 125 days, 19:22,  1 user,  load average: 0.24, 0.08, 0.02
114 processes: 112 sleeping, 1 running, 1 zombie, 0 stopped
CPU states:  0.3% user,  3.2% system,  0.0% nice, 96.3% idle
Mem:   516140K av,  513048K used,    3092K free,   62796K shrd,  
140192K buff
Swap:  265032K av,    4320K used,  260712K free                  
135572K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 1724 root      13   0  1064 1064   816 R     3.4  0.2   0:04 top
 1491 nsamain    0   0  3872 3872  1536 S     0.1  0.7   0:00 nsamain
20984 nsamain    0   0  3872 3872  1536 S     0.0  0.7   0:00 nsamain
20988 nsamain    0   0  3872 3872  1536 S     0.0  0.7   0:00 nsamain
20989 nsamain    0   0  3872 3872  1536 S     0.0  0.7   0:00 nsamain
20990 nsamain    0   0  3872 3872  1536 S     0.0  0.7   0:00 nsamain
20991 nsamain    0   0  3872 3872  1536 S     0.0  0.7   0:00 nsamain
20992 nsamain    0   0  3872 3872  1536 S     0.0  0.7   0:05 nsamain
20997 nsatgn     0   0  159M 159M  1888 S     0.0 31.5   0:01 nsd8x
21001 nsatgn     6   0  159M 159M  1888 S     0.0 31.5   0:00 nsd8x
21002 nsatgn     0   0  159M 159M  1888 S     0.0 31.5   0:00 nsd8x
21003 nsatgn     0   0  159M 159M  1888 S     0.0 31.5   0:00 nsd8x
21004 nsatgn     0   0  159M 159M  1888 S     0.0 31.5   0:00 nsd8x
21005 nsatgn     0   0  159M 159M  1888 S     0.0 31.5   0:00 nsd8x
21006 nsatgn     0   0  159M 159M  1888 S     0.0 31.5   0:03 nsd8x
21013 nsatgn     0   0  159M 159M  1888 S     0.0 31.5   0:00 nsd8x
21021 nsaerc     0   0 24892  24M  1856 S     0.0  4.8   0:01 nsaerc
21025 nsaerc     6   0 24892  24M  1856 S     0.0  4.8   0:00 nsaerc
21026 nsaerc     0   0 24892  24M  1856 S     0.0  4.8   0:00 nsaerc
21027 nsaerc     0   0 24892  24M  1856 S     0.0  4.8   0:00 nsaerc
21028 nsaerc     0   0 24892  24M  1856 S     0.0  4.8   0:00 nsaerc
21030 nsaerc     0   0 24892  24M  1856 S     0.0  4.8   0:00 nsaerc

I bet you guessed, nsatgn is the busy site! I'm assuming that the failure error above was because memory was exausted....

So, could there be something in *my* Openacs code that is causing the memory consumption and if so, what is it or how do I find it?

THANK YOU.
-Bob

Collapse
Posted by Jerry Asher on
Hi Bob,

There were several memory leaks in AOLserver 3.2 that were fixed, if I recall, for AOLserver 3.3 and 3.4.

However, in my most recent nsvhr/nsunix patches (for AOLserver 3.3ad13), I found one uh, semi-major AOLserver bug that I just fixed: in their communication driver interface, they let you define a "free" proc, to presumably, free memory and structures.  Guess what, they never bother to call it!  Oops.  Nssock and nsssl doesn't use it, so most folks never notice, but nsunix uses it.  Over time, that could build up to a pretty big memory leak too.  In creating the new patches, I put several million requests during testing through nsvhr/nssock and nsvhr/nsunix and didn't note any memory leaks.  Those were of static pages and not .tcl pages however, and that's where the more typical AOLserver memory leaks have been found.

My site is down at the moment as I recover from what was supposed to be a minor postgres upgrade, but it should be back up sometime tonight or tomorrow, and you should consider moving to AOLserver 3.3ad13.

Collapse
Posted by Yon Derek on
Could you describe when this "free" proc should be called, roughly? I've had the same problem (nsunix eating memory on 32ad12) so I would like at least patch this setup. I assume that 3.3ad13 without your patches leaks as well?
Collapse
Posted by Jerry Asher on
I don't know "where" it should be called. I know what I did.

I created a routine nsd/conn.c/Ns_ConnFree to call the underlying driver freeProc (if it exists)

/*
 *-----------------------------------------------------------------
 *
 * Ns_ConnFree - Free a connection.
 *
 * Results:
 *	Always NS_OK.
 * 
 * Side effects:
 *	The underlying driver connection buffers are freed
 *
 *-----------------------------------------------------------------
 */

int
Ns_ConnFree(Ns_Conn *conn)
{
    Conn             *connPtr = (Conn *)conn;
    
    if (connPtr->drvPtr->freeProc != NULL) {
        (*connPtr->drvPtr->freeProc)(connPtr->drvData);
    }
    return NS_OK;
}

I then call this function in two places: first, at the very end of nsd/serv.c/ConnRun (just prior to freeing the DString).
. . .
    Ns_ConnFree(conn);
    Ns_DStringFree(&ds);
}
And also within nsd/drv.c/RunDriver if there is a problem queueing a connection with Ns_QueueConn:
    while ((status = ((*dPtr->acceptProc)(dData, &cData))) == NS_OK) {
	/* Bug fix note: Call with dPtr and not dData
	   I know this is right for nsunix,
	   and it would be right for nssock and nsssl except
	   that nssock and nsssl do not use THIS procedure.
	   Are there any other modules that need to be looked at? */
	if (Ns_QueueConn(dPtr, cData) != NS_OK) {
            /* Bug fix note: now call with cData and not dData! */
            // (*dPtr->closeProc)(dData);
            (*dPtr->closeProc)(cData);
            /*
             * let the driver free connection structures
             */
            if (dPtr->freeProc != NULL) {
                (*dPtr->freeProc)(cData);
            }
	}
    }
Of course, my patches incorporate this and lots lots more. Act now! Supplies are limited!
Collapse
Posted by Don Baccus on
AOL32+ad12 will leak if you're using Tcl 8.3, Jerry's patch will only help if you are using his nsunix etc stuff.  AOL33+ad13+Jerry's patches would be best.  If you're not using Jerry's stuff AOL33+ad13 should be sufficient.
Collapse
Posted by Bob OConnor on

I asked Jerry how to upgrade in this thread on his site:
Upgrade Steps AOL_3.2ad12 to 3.3ad13+patches

Thank you Jerry!

-Bob

Email Ref: http://www.theashergroup.com/bboard/q-and-a-fetch-msg?msg_id=000001

Collapse
Posted by Jun Yamog on
Hi Bob,

I upgraded my aolserver3.2+ad12 because of this leak.  Anyway if you
are running non-ACS sites aolserver.3.3+ad13 has a minor problem.

ns_get_form will not work since empty_string_p was not placed on it.
So just copy the ACS empty_string_p proc to patch.tcl.  Place this
patch.tcl to aolserver/modules/tcl directory.  And restart aolserver.

Jun

Collapse
Posted by Jerry Asher on
Jun's right about that, but actually, I included that fix in my latest nsvhr/nsunix patch for AOLserver 3.3ad13.

(I tried to keep this latest patch focused on nsvhr/nsunix issues, but the empty_string_p thing kept coming up every time I tried to test the patch against a fresh copy of AOLserver.)