Forum OpenACS Development: Performance problem in PG 8.1.2

Collapse
Posted by Janine Ohmer on
One of our hosting clients has asked me for help with this problem and I thought I'd toss it out to see if it seemed familiar to anyone.

He's got a test table with 10 million rows in it, containing a single varchar(64) column. Selecting either a single value or a count(*) is very slow, and the query plan shows a sequential scan.

He tried to build an index, but the resulting load on the server brought his live site to its knees, so he had to kill the process.

So, two questions:

1. Is building an index the right fix here, or will I also/instead need to read up on tuning parameters? I don't usually have to do anything to PG, it Just Works. But we don't generally see tables with that many rows, either.

2. Is there some way to throttle the index building process so it doesn't take up all the system resources? A fallback would be to have him create a new table, put the index on while it's empty, and then programmatically move his data over, putting in delays as necessary to slow it down. But that's pretty kludgy.

Suggestion? Thanks in advance!

Collapse
Posted by Tom Jackson on
As a general consideration, indexes are _more important_ if you have a large table. On benefit of using a database is efficient sorting, in fact, no query should ever go live without an 'order by' clause. So, you need indexes.

As far as efficiency in building the index: most import scripts will disable the index while data is loaded, then add the index. So it is more efficient to index after all the data is loaded.

As a positive note: if you can bring your live site to its knees with this type of operation, that is a good thing. Most sites suffer with a lack of success. Oh, and obviously, Oracle would breeze through this process with no problem.

Collapse
Posted by Malte Sussdorff on
If he has the chance he should upgrade to 8.2.x version of PostgreSQL which got considerable performance enhancements, though I am not sure if they are in this area as well.
Collapse
Posted by Janine Ohmer on
For completeness' sake, I was wrong about the version. I took his word for it when he said 8.1.2 but it's actually 8.2.3. I was hoping that would be an easy fix, but no joy.
Collapse
Posted by Tom Jackson on
One problem with giving any advice is not knowing the type of data or the search type you need to do. It isn't surprising that MySQL would do simple count or even search faster than pg.

If you are doing a unique key search, then count is either 0 or 1. If you are doing a text match, then a unique key index isn't going to help much unless it is 'abc%' type. With the low size of 64chars, tsearch might be a very fast index and even pretty efficient on storage.

Collapse
Posted by Gustaf Neumann on
To add to Tom's statement: The data types are quite important. In case you have numbers to index on, don't use "numeric" as data type, unless you really have to. We found out in one of our applications that a search over integer is much more efficient.

Be as well aware, the postgresql supports different types of indices. I have seen a couple of benchmarks in german it-newspapers comparing mysql and postgresql showing that for large data-sets postgres actually is superior.

Collapse
Posted by Tom Jackson on
Another point about using tsearch: you should be able to make a copy of the table, add the tsearch column and create the index anywhere. Since the index is part of the row data, this is possible and easy without some hack involved. Then you could copy the data back into the original table, modified. You may have to go offline for a brief time, but you should be able to do all this testing without interfering with the live site.
Collapse
Posted by Dave Bauer on
You don't say if you did this aleady, so I would recommend posting on the postgresql mailing lists. I think they have a performance or sql list where they focus on this type of issue.

Count(*) is ALWAYS slow on postgresql, so the index won't fix that. The index should fix scanning for a single column.

I can think of a few parameters that you might want to look at that helped us tune a pretty large postgresql install.

Shared buffers was increased to 120000 on a machine with 8G of ram i think. Increasing further did not help much. I tested this experimentally.

max_fsm_pages set to 200000
work_mem increased to 2048

effective_cache_size set to 32768

This is pretty important
default_statistics_target set to 100
(not sure what the default is, 10 i think)

These all help, we got some slow queries to switch from sequential scans to index scans using this technique.

The big thing this sped up was full text search with tsearch2. (That is why shared buffers is so big.)

I also noticed that in quite a few places, we had to rewrite queries that ran fine on smaller databases or PG 7.x but ran very slow on PG 8.x with huge tables. This is mainly queries on views. Joins on multiple views seem to be very slow. This is not an issue with your example of course.

TO summarize, you can't avoid a sequential scan on that table to select a single row without an index. Unfortunately I don't know how to improve index creation, but I hope some of this information will be useful. Definitely ask about the index creation on one of the PG mailing lists.

Collapse
Posted by Janine Ohmer on
Thanks for the info, guys.

I forgot to mention that he loaded the same data into MySQL and had much better performance. I don't think he had an index on that table either. Ouch!