We've just run into a situation where some bad HTML took our server down. The bad HTML (unclosed tags) caused the text/html conversion proc called in acs_mail_lite::send_immediately to return an error. Although the error is caught in the sweeper (acs-mail-lite-procs.tcl), the memory consumed in the scheduled thread is not released. Since the message wasn't sent, the sweeper tries to process it again and again. In our case, the outgoing message contained batched notification for many edits on a handful of large wiki pages which totaled about 8mb. With the sweeper running every minute, you can see how this can get out of hand quickly.
We've identified the source of the bad HTML and, locally, we've modified the sweeper to leave a bad message locked so it won't be processed again. This isn't a proper long-term solution, though. Rather than silently deleting the message or abusing the semantics of the locking_server column, I suggest we generalize the "locking_server" column into a status column that can be 'locked', or 'failed'. Thoughts?
Request notifications