Forum OpenACS Development: performance hint for busy sites with linux

Dear all,

maybe somebody finds this interesting and helpful:

In March we spent quite a long time to understand a set of strange performance problems under Linux, where e.g. password verification from an external server took up to 20 or more seconds (password server idle, no unusual network traffic), or where sometimes simple db operations in PostgreSQL took suddenly 10secs, or 1 minute, etc. We tried to address the PostgreSQL problem with various tune options for checkpoint writing with no apparent success.

The real cause is actually well known as the "ext3 latency problem" and become famous as the Firefox system-freeze problem under Linux, where the fix of the problem is outside of Firefox. In short, the problem is caused by the "ordered writes" in fsync operations (data=ordered) which means that the writing of meta-data is delayed until all writing of content-data has finished. During this time period, the file system might block every request. If there is a lot of data written, the blocking of the file system can take a long time (even minutes). During this period, the whole file-system can stall, the whole system freezes. The problem with authentication came actually from writing to the log-file, which is blocked as well.

There is a long discussion about the potential data loss implications of changing data=ordered into data=writeback (which makes ext3 more similar to ext2). Read the discussion below and build your own opinion. Linus decided to change the default mount option for ext3 to data=writeback in newer Linux versions (2.6.30+).

If you have a new kernel, and you use no special mount options on your site, you are using this already. If you have an older version of the linux kernel, you might consider to alter the journaling with e.g.

sudo tune2fs -o journal_data_writeback /dev/hda1

Consider this only if you have a busy site, were large amounts of data are written. For our production site this change made a big difference.

http://lwn.net/Articles/328363/
http://article.gmane.org/gmane.linux.kernel/818261

Best regards

-gustaf neumann

Collapse
Posted by Eduardo Santos on
Hi Gustaf,

Thank you very much for sharing this with us. I would never think off such a thing.

I guess a good way to find out if this is hapenning to our system is to see if there are some queries taking longer than it should without any clear reason? Or maybe some other I/O operations being too slow?

Collapse
Posted by Gustaf Neumann on
A good start is to grep for "longdb" in your error log and check for queries that are fast under normal conditions.

Another option in general is to switch to other file systems, especially with newer kernels. Here are some options tested with PostgreSQL 9 with some recent Linux kernels, using already data=writeback per default
http://www.phoronix.com/scan.php?page=article&item=linux_2638_large&num=2

One can see, ext3 or ext4 are not among the fastests for TPC-B with PostgreSQL. I would not recommend everybody to switch blindly. Sites with low db activity and traffic won't see any difference.