Forum OpenACS Q&A: Outer join Query for Users_files: Help

Hi OpenACS-ers,
I need help with an outer join. I want the home pages module to show a "*" star in front of users who have stored files in the file storage area.

So the USERS table now lists: user_id, last_name etc:

    1 Alpha
    2 Beta
    3 Gamma

Only Beta has uploaded files as stored in USERS_FILES. The users files table also has rows for directories so to find files it appears best to look in the file_size column for rows with greater than zero. There may be multiple files and therefore multiple rows.

Ultimately what I want is a list with

    1 Alpha
    2 Beta *
    3 Gamma

Where the "*" star is an indication that beta has files stored on the system...AND what I need help with is the SQL statement that will get this information. It looks like an outer join problem or a union problem but I am having trouble because I only want to return one row for each user in USERS and there may be ZERO, One or multiple rows in the associated USERS_FILES table.... BTW, the USERS_FILES table uses "owner_id" to match "user_id" in USERS.

Help Please AND Thank you!

-Bob

Collapse
Posted by Todd Gillespie on
I think this way is simplest:

select u.user_id,u.last_name,
  case when exists (select 1 from users_files uf where uf.owner_id = u.user_id) then '*'
   else null
  end as file_owner_flag
from users

If you think about it, I'm not sure that an outer join is appropriate here. You aren't trying to retrieve data in the users_files table, but rather use it to answer a boolean question on rows in users.

HTH

Collapse
Posted by Dan Wickstrom on
In this case, the correlated subquery in the target list is equivalent to an outer join. In fact, I believe that I read on one of the pg lists that they're going to do some optimizations in the planner to convert correlated subqueries to outer joins to improve performance. The equivalent outer join version is as follows:

select u.user_id,
       u.last_name,
       case when uf.owner_id is not null 
               then '*'::text 
               else null::text end as file_owner_flag
from users u 
       LEFT OUTER JOIN 
     users_files uf ON u.user_id = uf.owner_id;


 user_id | last_name | file_owner_flag 
---------+-----------+-----------------
       1 | foo       | 
       2 | bar       | *
       3 | baz       | 
(3 rows)

test=# 

In most cases the outer join version should be faster.

Collapse
Posted by Bob OConnor on

Thanks Todd, that's what I needed:

...
case when exists 
    (select 1 from users_files uf 
      where uf.owner_id = u.user_id 
        and uf.file_size > 0)
         then '*'
         else ' '
         end as file_owner
...

For those lurking, I added the
uf.file_size > 0
so that it would only star for users with files. This code was added to the select in /homepage/all.tcl

And Thanks Dan, I did notice that this is slower to show so it will be great when it's optimized.

-Bob

Collapse
Posted by Dan Wickstrom on
Ok, I didn't read your original post carefully. The query I posted wouldn't work if a user had more than one file in user_files. The correct way to do it with an outer join would be to do the following:

select u.user_id,
       u.last_name,
       uf.file_owner_flag
from users u 
       LEFT OUTER JOIN 
     (select owner_id, '*'::text as file_owner_flag 
        from users_files u
       where file_size > 0 
         and owner_id = u.owner_id) uf ON u.user_id = uf.owner_id;

You were correct about Todd's version being faster. I did some benchmarking on the two versions, and Todd's version was quite a bit faster. This was a misconception on my part, as I've always had the impression that a subselect in the target list would run slowly. Sorry about the misinformation.

Collapse
Posted by Dan Wickstrom on
Grrr... I should quit while I'm ahead, the above query needs a group by to work with multiple entries in users_files:

select u.user_id,
       u.last_name,
       uf.file_owner_flag
from users u 
       LEFT OUTER JOIN 
     (select owner_id, '*'::text as file_owner_flag 
        from users_files u
       where file_size > 0 
         and owner_id = u.owner_id
    group by owner_id, file_owner_flag) uf ON u.user_id = uf.owner_id;

Collapse
Posted by Todd Gillespie on
Dan: the case-with-subselect is faster than the "outer join with projected view with group by" because the 'CASE WHEN EXISTS' need only find the first row, at which point it stops executing. (faster to pronounce, too). The outer join scans the entirety of the relevant subset of the table, groups it, and only then verifies existence. This may already be obvious to you by this point, in which case I'm speaking to the silent audience.

As for stopping while you're ahead... better to stop when you're correct.

Collapse
Posted by Dan Wickstrom on
Oh I understand how the "when exists" clause works, and I understand that using a "group by" statement adds an additional sort stage to a queries execution plan, thus making it slower.  What I was surprised by was the fact that the subselect in the target list was faster than a simple outer-join like the one that I used in my first post.  I've followed several threads on the pg mailing lists where the topic of discussion centered around performance problems with subselects, and I've always assumed that in most cases they are slower than joins.

Things change pretty fast with postgresql, so it's not always safe to make generalized statements with regards to performance.  At least that's what I've learned.