test-doc - Weblog Page

Filtered by category beginner, 21 - 30 of 42 Postings (all, summary)

Trees

Created by Anett Szabo, last modified by Anett Szabo 25 Jul 2007, at 04:29 PM

On its face, the relational database management system would appear to be a very poor tool for representing and manipulating trees. This chapter is designed to accomplish the following things:

show you that a row in an SQL database can be thought of as an object
show you that a pointer from one object to another can be represented by storing an integer key in a regular database column
demonstrate the Oracle tree extensions (CONNECT BY ... PRIOR)
show you how to work around the limitations of CONNECT BY with PL/SQL

The canonical example of trees in Oracle is the org chart.


create table corporate_slaves (
       slave_id		      integer primary key,
       supervisor_id	      references corporate_slaves,
       name		      varchar(100)
);

insert into corporate_slaves values (1, NULL, 'Big Boss Man');
insert into corporate_slaves values (2, 1, 'VP Marketing');
insert into corporate_slaves values (3, 1, 'VP Sales');
insert into corporate_slaves values (4, 3, 'Joe Sales Guy');
insert into corporate_slaves values (5, 4, 'Bill Sales Assistant');
insert into corporate_slaves values (6, 1, 'VP Engineering');
insert into corporate_slaves values (7, 6, 'Jane Nerd');
insert into corporate_slaves values (8, 6, 'Bob Nerd');

SQL> column name format a20
SQL> select * from corporate_slaves;

  SLAVE_ID SUPERVISOR_ID NAME
---------- ------------- --------------------
	 1		 Big Boss Man
	 2	       1 VP Marketing
	 3	       1 VP Sales
	 4	       3 Joe Sales Guy
	 6	       1 VP Engineering
	 7	       6 Jane Nerd
	 8	       6 Bob Nerd
	 5	       4 Bill Sales Assistant

8 rows selected.

The integers in the supervisor_id are actually pointers to other rows in the corporate_slaves table. Need to display an org chart? With only standard SQL available, you'd write a program in the client language (e.g., C, Lisp, Perl, or Tcl) to do the following:

query Oracle to find the employee where supervisor_id is null, call this $big_kahuna_id
query Oracle to find those employees whose supervisor_id = $big_kahuna_id
for each subordinate, query Oracle again to find their subordinates.
repeat until no subordinates found, then back up one level

With the Oracle CONNECT BY clause, you can get all the rows out at once:


select name, slave_id, supervisor_id
from corporate_slaves
connect by prior slave_id = supervisor_id;

NAME		       SLAVE_ID SUPERVISOR_ID
-------------------- ---------- -------------
Big Boss Man		      1
VP Marketing		      2 	    1
VP Sales		      3 	    1
Joe Sales Guy		      4 	    3
Bill Sales Assistant	      5 	    4
VP Engineering		      6 	    1
Jane Nerd		      7 	    6
Bob Nerd		      8 	    6

VP Marketing		      2 	    1

VP Sales		      3 	    1
Joe Sales Guy		      4 	    3
Bill Sales Assistant	      5 	    4

Joe Sales Guy		      4 	    3
Bill Sales Assistant	      5 	    4

VP Engineering		      6 	    1
Jane Nerd		      7 	    6
Bob Nerd		      8 	    6

Jane Nerd		      7 	    6

Bob Nerd		      8 	    6

Bill Sales Assistant	      5 	    4

20 rows selected.

This seems a little strange. It looks as though Oracle has produced all possible trees and subtrees. Let's add a START WITH clause:


select name, slave_id, supervisor_id
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id in (select slave_id 
                        from corporate_slaves 
                        where supervisor_id is null);

NAME		       SLAVE_ID SUPERVISOR_ID
-------------------- ---------- -------------
Big Boss Man		      1
VP Marketing		      2 	    1
VP Sales		      3 	    1
Joe Sales Guy		      4 	    3
Bill Sales Assistant	      5 	    4
VP Engineering		      6 	    1
Jane Nerd		      7 	    6
Bob Nerd		      8 	    6

8 rows selected.

Notice that we've used a subquery in the START WITH clause to find out who is/are the big kahuna(s). For the rest of this example, we'll just hard-code in the slave_id 1 for brevity.

Though these folks are in the correct order, it is kind of tough to tell from the preceding report who works for whom. Oracle provides a magic pseudo-column that is meaningful only when a query includes a CONNECT BY. The pseudo-column is level:


select name, slave_id, supervisor_id, level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1;

NAME		       SLAVE_ID SUPERVISOR_ID	   LEVEL
-------------------- ---------- ------------- ----------
Big Boss Man		      1 		       1
VP Marketing		      2 	    1	       2
VP Sales		      3 	    1	       2
Joe Sales Guy		      4 	    3	       3
Bill Sales Assistant	      5 	    4	       4
VP Engineering		      6 	    1	       2
Jane Nerd		      7 	    6	       3
Bob Nerd		      8 	    6	       3

8 rows selected.

The level column can be used for indentation. Here we will use the concatenation operator (||) to add spaces in front of the name column:


column padded_name format a30

select 
  lpad(' ', (level - 1) * 2) || name as padded_name, 
  slave_id, 
  supervisor_id, 
  level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1;

PADDED_NAME			 SLAVE_ID SUPERVISOR_ID      LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man				1			 1
  VP Marketing				2	      1 	 2
  VP Sales				3	      1 	 2
    Joe Sales Guy			4	      3 	 3
      Bill Sales Assistant		5	      4 	 4
  VP Engineering			6	      1 	 2
    Jane Nerd				7	      6 	 3
    Bob Nerd				8	      6 	 3

8 rows selected.

If you want to limit your report, you can use standard WHERE clauses:


select 
  lpad(' ', (level - 1) * 2) || name as padded_name, 
  slave_id, 
  supervisor_id, 
  level
from corporate_slaves
where level <= 3
connect by prior slave_id = supervisor_id
start with slave_id = 1;

PADDED_NAME			 SLAVE_ID SUPERVISOR_ID      LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man				1			 1
  VP Marketing				2	      1 	 2
  VP Sales				3	      1 	 2
    Joe Sales Guy			4	      3 	 3
  VP Engineering			6	      1 	 2
    Jane Nerd				7	      6 	 3
    Bob Nerd				8	      6 	 3

7 rows selected.

Suppose that you want people at the same level to sort alphabetically. Sadly, the ORDER BY clause doesn't work so great in conjunction with CONNECT BY:


select 
  lpad(' ', (level - 1) * 2) || name as padded_name, 
  slave_id, 
  supervisor_id, 
  level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1
order by level, name;

PADDED_NAME			 SLAVE_ID SUPERVISOR_ID      LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man				1			 1
  VP Engineering			6	      1 	 2
  VP Marketing				2	      1 	 2
  VP Sales				3	      1 	 2
    Bob Nerd				8	      6 	 3
    Jane Nerd				7	      6 	 3
    Joe Sales Guy			4	      3 	 3
      Bill Sales Assistant		5	      4 	 4

select 
  lpad(' ', (level - 1) * 2) || name as padded_name, 
  slave_id, 
  supervisor_id, 
  level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1
order by name;

PADDED_NAME			 SLAVE_ID SUPERVISOR_ID      LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man				1			 1
      Bill Sales Assistant		5	      4 	 4
    Bob Nerd				8	      6 	 3
    Jane Nerd				7	      6 	 3
    Joe Sales Guy			4	      3 	 3
  VP Engineering			6	      1 	 2
  VP Marketing				2	      1 	 2
  VP Sales				3	      1 	 2

SQL is a set-oriented language. In the result of a CONNECT BY query, it is precisely the order that has value. Thus it doesn't make much sense to also have an ORDER BY clause.

JOIN doesn't work with CONNECT BY

If we try to build a report showing each employee and his or her supervisor's name, we are treated to one of Oracle's few informative error messages:


select 
  lpad(' ', (level - 1) * 2) || cs1.name as padded_name, 
  cs2.name as supervisor_name 
from corporate_slaves cs1, corporate_slaves cs2
where cs1.supervisor_id = cs2.slave_id(+)
connect by prior cs1.slave_id = cs1.supervisor_id
start with cs1.slave_id = 1;

ERROR at line 4:
ORA-01437: cannot have join with CONNECT BY

We can work around this particular problem by creating a view:


create or replace view connected_slaves 
as
select 
  lpad(' ', (level - 1) * 2) || name as padded_name, 
  slave_id, 
  supervisor_id, 
  level as the_level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1;

Notice that we've had to rename level so that we didn't end up with a view column named after a reserved word. The view works just like the raw query:


select * from connected_slaves;

PADDED_NAME			 SLAVE_ID SUPERVISOR_ID  THE_LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man				1			 1
  VP Marketing				2	      1 	 2
  VP Sales				3	      1 	 2
    Joe Sales Guy			4	      3 	 3
      Bill Sales Assistant		5	      4 	 4
  VP Engineering			6	      1 	 2
    Jane Nerd				7	      6 	 3
    Bob Nerd				8	      6 	 3

8 rows selected.

but we can JOIN now


select padded_name, corporate_slaves.name as supervisor_name
from connected_slaves, corporate_slaves
where connected_slaves.supervisor_id = corporate_slaves.slave_id(+);

PADDED_NAME		       SUPERVISOR_NAME
------------------------------ --------------------
Big Boss Man
  VP Marketing		       Big Boss Man
  VP Sales		       Big Boss Man
    Joe Sales Guy	       VP Sales
      Bill Sales Assistant     Joe Sales Guy
  VP Engineering	       Big Boss Man
    Jane Nerd		       VP Engineering
    Bob Nerd		       VP Engineering

8 rows selected.

If you have sharp eyes, you'll notice that we've actually OUTER JOINed so that our results don't exclude the big boss.

Select-list subqueries do work with CONNECT BY

Instead of the VIEW and JOIN, we could have added a subquery to the select list:


select 
  lpad(' ', (level - 1) * 2) || name as padded_name, 
  (select name 
   from corporate_slaves cs2
   where cs2.slave_id = cs1.supervisor_id) as supervisor_name
from corporate_slaves cs1
connect by prior slave_id = supervisor_id
start with slave_id = 1;

PADDED_NAME		       SUPERVISOR_NAME
------------------------------ --------------------
Big Boss Man
  VP Marketing		       Big Boss Man
  VP Sales		       Big Boss Man
    Joe Sales Guy	       VP Sales
      Bill Sales Assistant     Joe Sales Guy
  VP Engineering	       Big Boss Man
    Jane Nerd		       VP Engineering
    Bob Nerd		       VP Engineering

8 rows selected.

The general rule in Oracle is that you can have a subquery that returns a single row anywhere in the select list.

Does this person work for me?

Suppose that you've built an intranet Web service. There are things that your software should show to an employee's boss (or boss's boss) that it shouldn't show to a subordinate or peer. Here we try to figure out if the VP Marketing (#2) has supervisory authority over Jane Nerd (#7):


select count(*) 
from corporate_slaves
where slave_id = 7
and level > 1
start with slave_id = 2
connect by prior slave_id = supervisor_id;

  COUNT(*)
----------
	 0

Apparently not. Notice that we start with the VP Marketing (#2) and stipulate level > 1 to be sure that we will never conclude that someone supervises him or herself. Let's ask if the Big Boss Man (#1) has authority over Jane Nerd:


select count(*) 
from corporate_slaves
where slave_id = 7
and level > 1
start with slave_id = 1
connect by prior slave_id = supervisor_id;

  COUNT(*)
----------
	 1

Even though Big Boss Man isn't Jane Nerd's direct supervisor, asking Oracle to compute the relevant subtree yields us the correct result. In the ArsDigita Community System Intranet module, we decided that this computation was too important to be left as a query in individual Web pages. We centralized the question in a PL/SQL procedure:


create or replace function intranet_supervises_p
  (query_supervisor IN integer, query_user_id IN integer)
return varchar
is
  n_rows_found integer;
BEGIN
  select count(*) into n_rows_found
   from intranet_users
   where user_id = query_user_id
   and level > 1
   start with user_id = query_supervisor
   connect by supervisor = PRIOR user_id;
  if n_rows_found > 0 then 
	return 't';
  else 
	return 'f';  
  end if;
END intranet_supervises_p;

Family trees

What if the graph is a little more complicated than employee-supervisor? For example, suppose that you are representing a family tree. Even without allowing for divorce and remarriage, exotic South African fertility clinics, etc., we still need more than one pointer for each node:


create table family_relatives (
	relative_id	integer primary key,
	spouse		references family_relatives,
	mother		references family_relatives,
	father		references family_relatives,
	-- in case they don't know the exact birthdate
	birthyear	integer,
	birthday	date,
	-- sadly, not everyone is still with us
	deathyear	integer,
	first_names	varchar(100) not null,
	last_name	varchar(100) not null,
	sex		char(1) check (sex in ('m','f')),
	-- note the use of multi-column check constraints
	check ( birthyear is not null or birthday is not null)
);

-- some test data 


insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(1, 'Nick', 'Gittes', 'm', NULL, NULL, NULL, 1902);

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(2, 'Cecile', 'Kaplan', 'f', 1, NULL, NULL, 1910);

update family_relatives 
set spouse = 2 
where relative_id = 1;

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(3, 'Regina', 'Gittes', 'f', NULL, 2, 1, 1934);

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(4, 'Marjorie', 'Gittes', 'f', NULL, 2, 1, 1936);

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(5, 'Shirley', 'Greenspun', 'f', NULL, NULL, NULL, 1901);

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(6, 'Jack', 'Greenspun', 'm', 5, NULL, NULL, 1900);

update family_relatives 
set spouse = 6
where relative_id = 5;

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(7, 'Nathaniel', 'Greenspun', 'm', 3, 5, 6, 1930);

update family_relatives 
set spouse = 7
where relative_id = 3;

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(8, 'Suzanne', 'Greenspun', 'f', NULL, 3, 7, 1961);

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(9, 'Philip', 'Greenspun', 'm', NULL, 3, 7, 1963);

insert into family_relatives 
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(10, 'Harry', 'Greenspun', 'm', NULL, 3, 7, 1965);

In applying the lessons from the employee examples, the most obvious problem that we face now is whether to follow the mother or the father pointers:


column full_name format a25

-- follow patrilineal (start with my mom's father)
select lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name
from family_relatives 
connect by prior relative_id = father
start with relative_id = 1;

FULL_NAME
-------------------------
Nick Gittes
  Regina Gittes
  Marjorie Gittes

-- follow matrilineal (start with my mom's mother)
select lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name
from family_relatives 
connect by prior relative_id = mother
start with relative_id = 2;

FULL_NAME
-------------------------
Cecile Kaplan
  Regina Gittes
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun
  Marjorie Gittes

Here's what the official Oracle docs have to say about CONNECT BY:

specifies the relationship between parent rows and child rows of the hierarchy. condition can be any condition as described in "Conditions". However, some part of the condition must use the PRIOR operator to refer to the parent row. The part of the condition containing the PRIOR operator must have one of the following forms:
PRIOR expr comparison_operator expr 
expr comparison_operator PRIOR expr 

There is nothing that says comparison_operator has to be merely the equals sign. Let's start again with my mom's father but CONNECT BY more than one column:


-- follow both
select lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name
from family_relatives 
connect by prior relative_id in (mother, father)
start with relative_id = 1;

FULL_NAME
-------------------------
Nick Gittes
  Regina Gittes
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun
  Marjorie Gittes

Instead of arbitrarily starting with Grandpa Nick, let's ask Oracle to show us all the trees that start with a person whose parents are unknown:


select lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name
from family_relatives 
connect by prior relative_id in (mother, father)
start with relative_id in (select relative_id from family_relatives
                           where mother is null
                           and father is null);

FULL_NAME
-------------------------
Nick Gittes
  Regina Gittes
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun
  Marjorie Gittes
Cecile Kaplan
  Regina Gittes
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun
  Marjorie Gittes
Shirley Greenspun
  Nathaniel Greenspun
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun
Jack Greenspun
  Nathaniel Greenspun
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun

22 rows selected.

PL/SQL instead of JOIN

The preceding report is interesting but confusing because it is hard to tell where the trees meet in marriage. As noted above, you can't do a JOIN with a CONNECT BY. We demonstrated the workaround of burying the CONNECT BY in a view. A more general workaround is using PL/SQL:


create or replace function family_spouse_name 
  (v_relative_id family_relatives.relative_id%TYPE)
return varchar
is
  v_spouse_id integer;
  spouse_name varchar(500);
BEGIN
  select spouse into v_spouse_id 
    from family_relatives
    where relative_id = v_relative_id;
  if v_spouse_id is null then
    return null;
  else 
    select (first_names || ' ' || last_name) into spouse_name
      from family_relatives
      where relative_id = v_spouse_id;
    return spouse_name;
  end if;
END family_spouse_name;
/
show errors

column spouse format a20

select 
  lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
  family_spouse_name(relative_id) as spouse
from family_relatives 
connect by prior relative_id in (mother, father)
start with relative_id in (select relative_id from family_relatives
                           where mother is null
                           and father is null);

FULL_NAME		  SPOUSE
------------------------- --------------------
Nick Gittes		  Cecile Kaplan
  Regina Gittes 	  Nathaniel Greenspun
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun
  Marjorie Gittes
Cecile Kaplan		  Nick Gittes
  Regina Gittes 	  Nathaniel Greenspun
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun
  Marjorie Gittes
Shirley Greenspun	  Jack Greenspun
  Nathaniel Greenspun	  Regina Gittes
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun
Jack Greenspun		  Shirley Greenspun
  Nathaniel Greenspun	  Regina Gittes
    Suzanne Greenspun
    Philip Greenspun
    Harry Greenspun

PL/SQL instead of JOIN and GROUP BY

Suppose that in addition to displaying the family tree in a Web page, we also want to show a flag when a story about a family member is available. First we need a way to represent stories:


create table family_stories (
	family_story_id		integer primary key,
	story			clob not null,
	item_date		date,
	item_year		integer,
	access_control		varchar(20)
             check (access_control in ('public', 'family', 'designated')),
	check (item_date is not null or item_year is not null)
);

-- a story might be about more than one person
create table family_story_relative_map (
	family_story_id		references family_stories,
	relative_id		references family_relatives,
	primary key (relative_id, family_story_id)
);

-- put in a test story
insert into family_stories
(family_story_id, story, item_year, access_control)
values
(1, 'After we were born, our parents stuck the Wedgwood in a cabinet 
and bought indestructible china.  Philip and his father were sitting at
the breakfast table one morning.  Suzanne came downstairs and, without
saying a word, took a cereal bowl from the cupboard, walked over to 
Philip and broke the bowl over his head.  Their father immediately
started laughing hysterically.', 1971, 'public');

insert into family_story_relative_map 
(family_story_id, relative_id)
values
(1, 8);

insert into family_story_relative_map 
(family_story_id, relative_id)
values
(1, 9);

insert into family_story_relative_map 
(family_story_id, relative_id)
values
(1, 7);

To show the number of stories alongside a family member's listing, we would typically do an OUTER JOIN and then GROUP BY the columns other than the count(family_story_id). In order not to disturb the CONNECT BY, however, we create another PL/SQL function:


create or replace function family_n_stories (v_relative_id family_relatives.relative_id%TYPE)
return integer
is
  n_stories integer;
BEGIN
  select count(*) into n_stories 
    from family_story_relative_map 
    where relative_id = v_relative_id;
  return n_stories;
END family_n_stories;
/
show errors

select 
  lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name, 
  family_n_stories(relative_id) as n_stories
from family_relatives 
connect by prior relative_id in (mother, father)
start with relative_id in (select relative_id from family_relatives
                           where mother is null
                           and father is null);

FULL_NAME		   N_STORIES
------------------------- ----------
Nick Gittes			   0
...
Shirley Greenspun		   0
  Nathaniel Greenspun		   1
    Suzanne Greenspun		   1
    Philip Greenspun		   1
    Harry Greenspun		   0
...

Working Backwards

What does it look like to start at the youngest generation and work back?


select 
  lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
  family_spouse_name(relative_id) as spouse
from family_relatives 
connect by relative_id in (prior mother, prior father)
start with relative_id = 9;

FULL_NAME		  SPOUSE
------------------------- --------------------
Philip Greenspun
  Regina Gittes 	  Nathaniel Greenspun
    Nick Gittes 	  Cecile Kaplan
    Cecile Kaplan	  Nick Gittes
  Nathaniel Greenspun	  Regina Gittes
    Shirley Greenspun	  Jack Greenspun
    Jack Greenspun	  Shirley Greenspun

We ought to be able to view all the trees starting from all the leaves but Oracle seems to be exhibiting strange behavior:


select 
  lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
  family_spouse_name(relative_id) as spouse
from family_relatives 
connect by relative_id in (prior mother, prior father)
start with relative_id not in (select mother from family_relatives 
                               union 
                               select father from family_relatives);

no rows selected

What's wrong? If we try the subquery by itself, we get a reasonable result. Here are all the relative_ids that appear in the mother or father column at least once.


select mother from family_relatives 
union 
select father from family_relatives

    MOTHER
----------
	 1
	 2
	 3
	 5
	 6
	 7


7 rows selected.

The answer lies in that extra blank line at the bottom. There is a NULL in this result set. Experimentation reveals that Oracle behaves asymmetrically with NULLs and IN and NOT IN:


SQL> select * from dual where 1 in (1,2,3,NULL);

D
-
X

SQL> select * from dual where 1 not in (2,3,NULL);

no rows selected

The answer is buried in the Oracle documentation of NOT IN: "Evaluates to FALSE if any member of the set is NULL." The correct query in this case?


select 
  lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
  family_spouse_name(relative_id) as spouse
from family_relatives 
connect by relative_id in (prior mother, prior father)
start with relative_id not in (select mother 
                               from family_relatives 
                               where mother is not null
                               union 
                               select father
                               from family_relatives 
                               where father is not null);

FULL_NAME		  SPOUSE
------------------------- --------------------
Marjorie Gittes
  Nick Gittes		  Cecile Kaplan
  Cecile Kaplan 	  Nick Gittes
Suzanne Greenspun
  Regina Gittes 	  Nathaniel Greenspun
    Nick Gittes 	  Cecile Kaplan
    Cecile Kaplan	  Nick Gittes
  Nathaniel Greenspun	  Regina Gittes
    Shirley Greenspun	  Jack Greenspun
    Jack Greenspun	  Shirley Greenspun
Philip Greenspun
  Regina Gittes 	  Nathaniel Greenspun
    Nick Gittes 	  Cecile Kaplan
    Cecile Kaplan	  Nick Gittes
  Nathaniel Greenspun	  Regina Gittes
    Shirley Greenspun	  Jack Greenspun
    Jack Greenspun	  Shirley Greenspun
Harry Greenspun
  Regina Gittes 	  Nathaniel Greenspun
    Nick Gittes 	  Cecile Kaplan
    Cecile Kaplan	  Nick Gittes
  Nathaniel Greenspun	  Regina Gittes
    Shirley Greenspun	  Jack Greenspun
    Jack Greenspun	  Shirley Greenspun

24 rows selected.

Performance and Tuning

Oracle is not getting any help from the Tree Fairy in producing results from a CONNECT BY. If you don't want tree queries to take O(N^2) time, you need to build indices that let Oracle very quickly answer questions of the form "What are all the children of Parent X?"

For the corporate slaves table, you'd want two concatenated indices:


create index corporate_slaves_idx1 
  on corporate_slaves (slave_id, supervisor_id);
create index corporate_slaves_idx2 
  on corporate_slaves (supervisor_id, slave_id);

Reference

---

based on SQL for Web Nerds

Categories: beginner (Audience)

More complex queries

Created by Anett Szabo, last modified by Anett Szabo 25 Jul 2007, at 02:28 PM

Suppose that you want to start lumping together information from multiple rows. For example, you're interested in JOINing users with their classified ads. That will give you one row per ad posted. But you want to mush all the rows together for a particular user and just look at the most recent posting time. What you need is the GROUP BY construct:


select users.user_id, users.email, max(classified_ads.posted)
from users, classified_ads
where users.user_id = classified_ads.user_id
group by users.user_id, users.email
order by upper(users.email);

   USER_ID EMAIL			       MAX(CLASSI
---------- ----------------------------------- ----------
     39406 102140.1200@compuserve.com	       1998-10-08
     39842 102144.2651@compuserve.com	       1998-12-13
     41426 50@seattle.va.gov		       1997-01-13
     37428 71730.345@compuserve.com	       1998-11-24
     35970 aaibrahim@earthlink.net	       1998-11-08
     36671 absolutsci@aol.com		       1998-10-06
     35781 alevy@agtnet.com		       1997-07-14
     40111 alexzorba@aol.com		       1998-09-25
     39060 amchiu@worldnet.att.net	       1998-12-11
     35989 andrew.c.beckner@bankamerica.com    1998-08-13
     33923 andy_roo@mit.edu		       1998-12-10

The group by users.user_id, users.email tells SQL to "lump together all the rows that have the same values in these two columns." In addition to the grouped by columns, we can run aggregate functions on the columns that aren't being grouped. For example, the MAX above applies to the posting dates for the rows in a particular group. We can also use COUNT to see at a glance how active and how recently active a user has been:


select users.user_id, users.email, count(*), max(classified_ads.posted)
from users, classified_ads
where users.user_id = classified_ads.user_id
group by users.user_id, users.email
order by upper(users.email);

   USER_ID EMAIL				 COUNT(*) MAX(CLASSI
---------- ----------------------------------- ---------- ----------
     39406 102140.1200@compuserve.com			3 1998-10-08
     39842 102144.2651@compuserve.com			3 1998-12-13
     41426 50@seattle.va.gov				1 1997-01-13
     37428 71730.345@compuserve.com			3 1998-11-24
     35970 aaibrahim@earthlink.net			1 1998-11-08
     36671 absolutsci@aol.com				2 1998-10-06
     35781 alevy@agtnet.com				1 1997-07-14
     40111 alexzorba@aol.com				1 1998-09-25
     39060 amchiu@worldnet.att.net			1 1998-12-11
     35989 andrew.c.beckner@bankamerica.com		1 1998-08-13
     33923 andy_roo@mit.edu				1 1998-12-10

A publisher who was truly curious about this stuff probably wouldn't be interested in these results alphabetically. Let's find our most recently active users. At the same time, let's get rid of the unsightly "MAX(CLASSI" at the top of the report:


select users.user_id, 
       users.email, 
       count(*) as how_many,
       max(classified_ads.posted) as how_recent
from users, classified_ads
where users.user_id = classified_ads.user_id
group by users.user_id, users.email
order by how_recent desc, how_many desc;

   USER_ID EMAIL				 HOW_MANY HOW_RECENT
---------- ----------------------------------- ---------- ----------
     39842 102144.2651@compuserve.com			3 1998-12-13
     39968 mkravit@mindspring.com			1 1998-12-13
     36758 mccallister@mindspring.com			1 1998-12-13
     38513 franjeff@alltel.net				1 1998-12-13
     34530 nverdesoto@earthlink.net			3 1998-12-13
     34765 jrl@blast.princeton.edu			1 1998-12-13
     38497 jeetsukumaran@pd.jaring.my			1 1998-12-12
     38879 john.macpherson@btinternet.com		5 1998-12-12
     37808 eck@coastalnet.com				1 1998-12-12
     37482 dougcan@arn.net				1 1998-12-12

Note that we were able to use our correlation names of "how_recent" and "how_many" in the ORDER BY clause. The desc ("descending") directives in the ORDER BY clause instruct Oracle to put the largest values at the top. The default sort order is from smallest to largest ("ascending").

Upon close inspection, the results are confusing. We instructed Oracle to rank first by date and second by number of postings. Yet for 1998-12-13 we don't see both users with three total postings at the top. That's because Oracle dates are precise to the second even when the hour, minute, and second details are not displayed by SQL*Plus. A better query would include the clause


order by trunc(how_recent) desc, how_many desc

where the built-in Oracle function trunc truncates each date to midnight on the day in question.

Finding co-moderators: The HAVING Clause

The WHERE clause restricts which rows are returned. The HAVING clause operates analogously but on groups of rows. Suppose, for example, that we're interested in finding those users who've contributed heavily to our discussion forum:


select user_id, count(*) as how_many
from bboard
group by user_id
order by how_many desc;

   USER_ID   HOW_MANY
---------- ----------
     34474	 1922
     35164	  985
     41112	  855
     37021	  834
     34004	  823
     37397	  717
     40375	  639
...
     33963	    1
     33941	    1
     33918	    1

7348 rows selected.

Seventy three hundred rows. That's way too big considering that we are only interested in nominating a couple of people. Let's restrict to more recent activity. A posting contributed three years ago is not necessarily evidence of interest in the community right now.


select user_id, count(*) as how_many
from bboard
where posting_time + 60 > sysdate
group by user_id
order by how_many desc;

   USER_ID   HOW_MANY
---------- ----------
     34375	   80
     34004	   79
     37903	   49
     41074	   46
...

1120 rows selected.

We wanted to kill rows, not groups, so we did it with a WHERE clause. Let's get rid of the people who are already serving as maintainers.


select user_id, count(*) as how_many
from bboard
where not exists (select 1 from 
                  bboard_authorized_maintainers bam
                  where bam.user_id = bboard.user_id)
and posting_time + 60 > sysdate
group by user_id
order by how_many desc;

The concept of User ID makes sense for both rows and groups, so we can restrict our results either with the extra WHERE clause above or by letting the relational database management system produce the groups and then we'll ask that they be tossed out using a HAVING clause:


select user_id, count(*) as how_many
from bboard
where posting_time + 60 > sysdate
group by user_id
having not exists (select 1 from 
                  bboard_authorized_maintainers bam
                  where bam.user_id = bboard.user_id)
order by how_many desc;

This doesn't get to the root cause of our distressingly large query result: we don't want to see groups where how_many is less than 30. Here's the SQL for "show me users who've posted at least 30 messages in the past 60 days, ranked in descending order of volubility":


select user_id, count(*) as how_many
from bboard
where posting_time + 60 > sysdate
group by user_id
having count(*) >= 30
order by how_many desc;

   USER_ID   HOW_MANY
---------- ----------
     34375	   80
     34004	   79
     37903	   49
     41074	   46
     42485	   46
     35387	   30
     42453	   30

7 rows selected.

We had to do this in a HAVING clause because the number of rows in a group is a concept that doesn't make sense at the per-row level on which WHERE clauses operate.

Oracle 8's SQL parser is too feeble to allow you to use the how_many correlation variable in the HAVING clause. You therefore have to repeat the count(*) incantation.

Set Operations: UNION, INTERSECT, and MINUS

Oracle provides set operations that can be used to combine rows produced by two or more separate SELECT statements. UNION pools together the rows returned by two queries, removing any duplicate rows. INTERSECT combines the result sets of two queries by removing any rows that are not present in both. MINUS combines the results of two queries by taking the the first result set and subtracting from it any rows that are also found in the second. Of the three, UNION is the most useful in practice.

In the ArsDigita Community System ticket tracker, people reporting a bug or requesting a feature are given a menu of potential deadlines. For some projects, common project deadlines are stored in the ticket_deadlines table. These should appear in an HTML SELECT form element. We also, however, want some options like "today", "tomorrow", "next week", and "next month". The easiest way to handle these is to query the dual table to perform some date arithmetic. Each of these queries will return one row and if we UNION four of them together with the query from ticket_deadlines, we can have the basis for a very simple Web script to present the options:


select 
  'today - ' || to_char(trunc(sysdate),'Mon FMDDFM'),
  trunc(sysdate) as deadline
from dual 
UNION
select 
  'tomorrow - '|| to_char(trunc(sysdate+1),'Mon FMDDFM'), 
  trunc(sysdate+1) as deadline
from dual 
UNION
select
  'next week - '|| to_char(trunc(sysdate+7),'Mon FMDDFM'), 
  trunc(sysdate+7) as deadline
from dual 
UNION
select 
  'next month - '|| to_char(trunc(ADD_MONTHS(sysdate,1)),'Mon FMDDFM'), 
  trunc(ADD_MONTHS(sysdate,1)) as deadline 
from dual
UNION
select 
  name || ' - ' || to_char(deadline, 'Mon FMDDFM'), 
  deadline
from ticket_deadlines 
where project_id = :project_id
and deadline >= trunc(sysdate)
order by deadline

will produce something like

<form>

<select name="deadline_choice">

<option value="2000-10-28">today - Oct 28

</option><option value="2000-10-29">tomorrow - Oct 29

</option><option value="2000-11-04">next week - Nov 4

</option><option value="2000-11-28">next month - Nov 28

</option><option value="2000-12-01">V2.0 freeze - Dec 1

</option><option value="2000-12-15">V2.0 ship - Dec 15

</option></select>

</form>

The INTERSECT and MINUS operators are seldom used. Here are some contrived examples. Suppose that you collect contest entries by Web users, each in a separate table:


create table trip_to_paris_contest (
	user_id		references users,
	entry_date	date not null
);

create table camera_giveaway_contest (
	user_id		references users,
	entry_date	date not null
);

Now let's populate with some dummy data:


-- all three users love to go to Paris
insert into trip_to_paris_contest values (1,'2000-10-20');
insert into trip_to_paris_contest values (2,'2000-10-22');
insert into trip_to_paris_contest values (3,'2000-10-23');

-- only User #2 is a camera nerd
insert into camera_giveaway_contest values (2,'2000-05-01');

Suppose that we've got a new contest on the site. This time we're giving away a trip to Churchill, Manitoba to photograph polar bears. We assume that the most interested users will be those who've entered both the travel and the camera contests. Let's get their user IDs so that we can notify them via email (spam) about the new contest:


select user_id from trip_to_paris_contest
intersect
select user_id from camera_giveaway_contest;

   USER_ID
----------
	 2

Or suppose that we're going to organize a personal trip to Paris and want to find someone to share the cost of a room at the Crillon. We can assume that anyone who entered the Paris trip contest is interested in going. So perhaps we should start by sending them all email. On the other hand, how can one enjoy a quiet evening with the absinthe bottle if one's companion is constantly blasting away with an electronic flash? We're interested in people who entered the Paris trip contest but who did not enter the camera giveaway:


select user_id from trip_to_paris_contest
minus
select user_id from camera_giveaway_contest;

   USER_ID
----------
	 1
	 3

---

based on SQL for Web Nerds

Categories: beginner (Audience)

SQL Tutorial

Created by Anett Szabo, last modified by Anett Szabo 25 Jul 2007, at 02:08 PM

Categories: beginner (Audience)

SQL for Web Nerds Introduction

Created by Anett Szabo, last modified by Anett Szabo 25 Jul 2007, at 02:07 PM

The architect of any new information system must decide how much responsibility for data management the new custom software should take and how much should be left to packaged software and the operating system. This chapter explains what kind of packaged data management software is available, covering files, flat file database management systems, the RDBMS, object-relational database management systems, and object databases. This chapter also introduces the SQL language.

What Do You Need for Transaction Processing?

Data processing folks like to talk about the "ACID test" when deciding whether or not a database management system is adequate for handling transactions. An adequate system has the following properties:

Atomicity

Results of a transaction's execution are either all committed or all rolled back. All changes take effect, or none do. That means, for Joe User's money transfer, that both his savings and checking balances are adjusted or neither are. For a Web content management example, suppose that a user is editing a comment. A Web script tells the database to "copy the old comment value to an audit table and update the live table with the new text". If the hard drive fills up after the copy but before the update, the audit table insertion will be rolled back.

Consistency

The database is transformed from one valid state to another valid state. This defines a transaction as legal only if it obeys user-defined integrity constraints. Illegal transactions aren't allowed and, if an integrity constraint can't be satisfied then the transaction is rolled back. For example, suppose that you define a rule that postings in a discussion forum table must be tied to a valid user ID. Then you hire Joe Novice to write some admin pages. Joe writes a delete-user page that doesn't bother to check whether or not the deletion will result in an orphaned discussion forum posting. The DBMS will check, though, and abort any transaction that would result in you having a discussion forum posting by a deleted user.

Isolation

The results of a transaction are invisible to other transactions until the transaction is complete. For example, if you are running an accounting report at the same time that Joe is transferring money, the accounting report program will either see the balances before Joe transferred the money or after, but never the intermediate state where checking has been credited but savings not yet debited.

Durability

Once committed (completed), the results of a transaction are permanent and survive future system and media failures. If the airline reservation system computer gives you seat 22A and crashes a millisecond later, it won't have forgotten that you are sitting in 22A and also give it to someone else. Furthermore, if a programmer spills coffee into a disk drive, it will be possible to install a new disk and recover the transactions up to the coffee spill, showing that you had seat 22A.

That doesn't sound too tough to implement, does it? And, after all, one of the most refreshing things about the Web is how it encourages people without formal computer science backgrounds to program. So why not build your Internet bank on a transaction system implemented by an English major who has just discovered Perl?

Because you still need indexing.

Finding Your Data (and Fast)

One facet of a database management system is processing inserts, updates, and deletes. This all has to do with putting information into the database. Sometimes it is also nice, though, to be able to get data out. And with popular sites getting 100 hits per second, it pays to be conscious of speed.

Flat files work okay if they are very small. A Perl script can read the whole file into memory in a split second and then look through it to pull out the information requested. But suppose that your on-line bank grows to have 250,000 accounts. A user types his account number into a Web page and asks for his most recent deposits. You've got a chronological financial transactions file with 25 million entries. Crunch, crunch, crunch. Your server laboriously works through all 25 million to find the ones with an account number that matches the user's. While it is crunching, 25 other users come to the Web site and ask for the same information about their accounts.

You have two choices: (1) buy a 64-processor Sun E10000 server with 64 GB of RAM, or (2) build an index file. If you build an index file that maps account numbers to sequential transaction numbers, your server won't have to search all 25 million records anymore. However, you have to modify all of your programs that insert, update, or delete from the database to also keep the index current.

This works great until two years later when a brand new MBA arrives from Harvard. She asks your English major cum Perl hacker for "a report of all customers who have more than $5,000 in checking or live in Oklahoma and have withdrawn more than $100 from savings in the last 17 days." It turns out that you didn't anticipate this query so your indexing scheme doesn't speed things up. Your server has to grind through all the data over and over again.

Enter the Relational Database

You are building a cutting-edge Web service. You need the latest and greatest in computer technology. That's why you use, uh, Unix. Yeah. Anyway, even if your operating system was developed in 1969, you definitely can't live without the most modern database management system available. Maybe this guy E.F. Codd can help:

"Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). ... Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.
"Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed. A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain operations on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency in the user's model."

Sounds pretty spiffy, doesn't it? Just like what you need. That's the abstract to "A Relational Model of Data for Large Shared Data Banks", a paper Codd wrote while working at IBM's San Jose research lab. It was published in the Communications of the ACM in June, 1970.

From an application programmer's point of view, the biggest innovation in the relational database is that one uses a declarative query language, SQL (an acronym for Structured Query Language and pronounced "ess-cue-el" or "sequel"). Most computer languages are procedural. The programmer tells the computer what to do, step by step, specifying a procedure. In SQL, the programmer says "I want data that meet the following criteria" and the RDBMS query planner figures out how to get it. There are two advantages to using a declarative language. The first is that the queries no longer depend on the data representation. The RDBMS is free to store data however it wants. The second is increased software reliability. It is much harder to have "a little bug" in an SQL query than in a procedural program. Generally it either describes the data that you want and works all the time or it completely fails in an obvious way.

Another benefit of declarative languages is that less sophisticated users are able to write useful programs. For example, many computing tasks that required professional programmers in the 1960s can be accomplished by non-technical people with spreadsheets. In a spreadsheet, you don't tell the computer how to work out the numbers or in what sequence. You just declare "This cell will be 1.5 times the value of that other cell over there."

How Does This RDBMS Thing Work?

This is all you need to know to be a Caveman Database Programmer: A relational database is a big spreadsheet that several people can update simultaneously.

Each table in the database is one spreadsheet. You tell the RDBMS how many columns each row has. For example, in our mailing list database, the table has two columns: name and email. Each entry in the database consists of one row in this table. An RDBMS is more restrictive than a spreadsheet in that all the data in one column must be of the same type, e.g., integer, decimal, character string, or date. Another difference between a spreadsheet and an RDBMS is that the rows in an RDBMS are not ordered. You can have a column named row_number and ask the RDBMS to return the rows ordered according to the data in this column, but the row numbering is not implicit as it would be with Visicalc or its derivatives such as Lotus 1-2-3 and Excel. If you do define a row_number column or some other unique identifier for rows in a table, it becomes possible for a row in another table to refer to that row by including the value of the unique ID.

Here's what some SQL looks like for the mailing list application

create table mailing_list (
        email           varchar(100) not null primary key,
        name            varchar(100)
);

The table will be called mailing_list and will have two columns, both variable length character strings. We've added a couple of integrity constraints on the email column. The not null will prevent any program from inserting a row where name is specified but email is not. After all, the whole point of the system is to send people e-mail so there isn't much value in having a name with no e-mail address. The primary key tells the database that this column's value can be used to uniquely identify a row. That means the system will reject an attempt to insert a row with the same e-mail address as an existing row. This sounds like a nice feature, but it can have some unexpected performance implications. For example, every time anyone tries to insert a row into this table, the RDBMS will have to look at all the other rows in the table to make sure that there isn't already one with the same e-mail address. For a really huge table, that could take minutes, but if you had also asked the RDBMS to create an index for mailing_list on email then the check becomes almost instantaneous. However, the integrity constraint still slows you down because every update to the mailing_list table will also require an update to the index and therefore you'll be doing twice as many writes to the hard disk.

That is the joy and the agony of SQL. Inserting two innocuous looking words can cost you a factor of 1000 in performance. Then inserting a sentence (to create the index) can bring you back so that it is only a factor of two or three. (Note that many RDBMS implementations, including Oracle, automatically define an index on a column that is constrained to be unique.)

Anyway, now that we've executed the Data Definition Language "create table" statement, we can move on to Data Manipulation Language: an INSERT.

insert into mailing_list (name, email)
values ('Philip Greenspun','philg@mit.edu');

Note that we specify into which columns we are inserting. That way, if someone comes along later and does

alter table mailing_list add (phone_number varchar(20));

the Oracle syntax for adding a column), our INSERT will still work. Note also that the string quoting character in SQL is a single quote. If you want to have a single quote in the string, double the single quote in "O'Grady":

insert into mailing_list (name, email)
values ('Michael O''Grady','ogrady@fastbuck.com');

Having created a table and inserted some data, at last we are ready to experience the awesome power of the SQL SELECT. Want your data back?

SQL> select * from mailing_list;


EMAIL   NAME     PHONE_NUMBER
------------------------- ------------------------- ------------
philg@mit.edu   Philip Greenspun
ogrady@fastbuck.com   Michael O'Grady

2 rows selected.

Note that there are no values in the phone_number column because we haven't set any. As soon as we do start to add phone numbers, we realize that our data model was inadequate. This is the Internet and Joe Typical User will have his pants hanging around his knees under the weight of a cell phone, beeper, and other personal communication accessories. One phone number column is clearly inadequate and even work_phone and home_phone columns won't accommodate the wealth of information users might want to give us. The clean database-y way to do this is to remove our phone_number column from the mailing_list table and define a helper table just for the phone numbers. Removing or renaming a column is easy nowadays, but for the beauty of clean code we drop the old table and create two new ones:

drop table mailing_list;

create table mailing_list (
        email           varchar(100) not null primary key,
        name            varchar(100)
);

create table phone_numbers (
        email           varchar(100) not null references mailing_list(email),
        number_type     varchar(15) check (number_type in ('work','home','cell','beeper')),
        phone_number    varchar(20) not null
);

Note that in this table the email column is not a primary key. That's because we want to allow multiple rows with the same e-mail address. If you are hanging around with a database nerd friend, you can say that there is a relationship between the rows in the phone_numbers table and the mailing_list table. In fact, you can say that it is a many-to-one relation because many rows in the phone_numbers table may correspond to only one row in the mailing_list table.

Another item worth noting about our two-table data model is that we do not store the user's name in the phone_numbers table. That would be redundant with the mailing_list table and potentially self-redundant as well, if, for example, "robert.loser@fastbuck.com" says he is "Robert Loser" when he types in his work phone and then "Rob Loser" when he puts in his beeper number, and "Bob Lsr" when he puts in his cell phone number while typing on his laptop's cramped keyboard.

Anyway, enough database nerdism. Let's populate the phone_numbers table:


SQL> insert into phone_numbers values ('ogrady@fastbuck.com','work','(800) 555-1212');

ORA-02291: integrity constraint (SCOTT.SYS_C001080) violated - parent key not found

Ooops! When we dropped the mailing_list table, we lost all the rows. The phone_numbers table has a referential integrity constraint ("references mailing_list") to make sure that we don't record e-mail addresses for people whose names we don't know. We have to first insert the two users into mailing_list:


insert into mailing_list (name, email)
values ('Philip Greenspun','philg@mit.edu');
insert into mailing_list (name, email)
values ('Michael O''Grady','ogrady@fastbuck.com');


insert into phone_numbers values ('ogrady@fastbuck.com','work','(800) 555-1212');
insert into phone_numbers values ('ogrady@fastbuck.com','home','(617) 495-6000');
insert into phone_numbers values ('philg@mit.edu','work','(617) 253-8574'); 
insert into phone_numbers values ('ogrady@fastbuck.com','beper','(617) 222-3456');

Note that the last four INSERTs use an evil SQL shortcut and don't specify the columns into which we are inserting data. The system defaults to using all the columns in the order that they were defined. Except for prototyping and playing around, we don't recommend ever using this shortcut.

The first three INSERTs work fine, but what about the last one, where Mr. O'Grady misspelled "beeper"?

ORA-02290: check constraint (SCOTT.SYS_C001079) violated

We asked Oracle at table definition time to check (number_type in ('work','home','cell','beeper')) and it did. The database cannot be left in an inconsistent state.

Let's say we want all of our data out. Email, full name, phone numbers. The most obvious query to try is a join.


SQL> select * from mailing_list, phone_numbers;

EMAIL		 NAME		  EMAIL 	   TYPE   NUMBER
---------------- ---------------- ---------------- ------ --------------
philg@mit.edu	 Philip Greenspun ogrady@fastbuck. work   (800) 555-1212
ogrady@fastbuck. Michael O'Grady  ogrady@fastbuck. work   (800) 555-1212
philg@mit.edu	 Philip Greenspun ogrady@fastbuck. home   (617) 495-6000
ogrady@fastbuck. Michael O'Grady  ogrady@fastbuck. home   (617) 495-6000
philg@mit.edu	 Philip Greenspun philg@mit.edu    work   (617) 253-8574
ogrady@fastbuck. Michael O'Grady  philg@mit.edu    work   (617) 253-8574

6 rows selected.

Yow! What happened? There are only two rows in the mailing_list table and three in the phone_numbers table. Yet here we have six rows back. This is how joins work. They give you the Cartesian product of the two tables. Each row of one table is paired with all the rows of the other table in turn. So if you join an N-row table with an M-row table, you get back a result with N*M rows. In real databases, N and M can be up in the millions so it is worth being a little more specific as to which rows you want:


select * 
from mailing_list, phone_numbers
where mailing_list.email = phone_numbers.email;

EMAIL		 NAME		  EMAIL 	   TYPE   NUMBER
---------------- ---------------- ---------------- ------ --------------
ogrady@fastbuck. Michael O'Grady  ogrady@fastbuck. work   (800) 555-1212
ogrady@fastbuck. Michael O'Grady  ogrady@fastbuck. home   (617) 495-6000
philg@mit.edu	 Philip Greenspun philg@mit.edu    work   (617) 253-8574

3 rows selected.

Probably more like what you had in mind. Refining your SQL statements in this manner can sometimes be more exciting. For example, let's say that you want to get rid of Philip Greenspun's phone numbers but aren't sure of the exact syntax.


SQL> delete from phone_numbers;

3 rows deleted.

Oops. Yes, this does actually delete all the rows in the table. You probably wish you'd typed

delete from phone_numbers where email = 'philg@mit.edu';

but it is too late now. Therefore be careful before executing "delete".

There is one more fundamental SQL statement to learn. Suppose that Philip moves to Hollywood to realize his long-standing dream of becoming a major motion picture producer. Clearly a change of name is in order, though he'd be reluctant to give up the e-mail address he's had since 1976. Here's the SQL:


SQL> update mailing_list set name = 'Phil-baby Greenspun' where email = 'philg@mit.edu';

1 row updated.

SQL> select * from mailing_list;

EMAIL		     NAME
-------------------- --------------------
philg@mit.edu	     Phil-baby Greenspun
ogrady@fastbuck.com  Michael O'Grady

2 rows selected.

As with DELETE, don't play around with UPDATE statements unless you have a WHERE clause at the end.

---

based on SQL for Web Nerds

Categories: beginner (Audience)

Solutions

Created by Anett Szabo, last modified by Anett Szabo 24 Jul 2007, at 04:14 PM

V.1 / 1

# I :: alpha -> alpha
proc I {x} { 
return $x 
}

V.1 / 2

# K :: alpha beta -> alpha
proc K {x y} {
return $x
}

V.3 / 1

# iota :: num -> [num]    
proc iota {n} {
  set I {}
  set i 0
    while {$i<$n} {
      lappend I "$i" 
    incr i
   }
return $I
}

V.3 / 2

# incrlist :: [num] -> [num]
proc incrlist {L} { 
  set k [list] 
  set i 0 
    while {$i<[llength $L]} {
      lappend k [ expr [lindex $L $i]+1 ] 
    incr i
 } 
return $k
}

# incrlist :: [num] -> [num]
proc inclist {L} { 
  set k {} 
  foreach i ${L} {
    lappend k [expr $i+1]
  } 
return $k
}

V.3/ 3

# strlenlist :: [str] -> [num]
proc strlenlist {L} { 
  set k {} 
  foreach i ${L} {
    lappend k [string length $i]
  } 
return $k
}

V.3 /4

proc sumlist {mylist} {
  set result 0
  foreach element $mylist {
      set result [expr $result + $element]
   }
return $result
}

V.3/5

proc multlist {mylist} {
  set result 1
  foreach element $mylist {
     set result [expr $result * $element]
   }
return $result
}

V.3 / 6

proc catlist {L} {
return [join $L ""]
}

Categories: beginner (Audience)

Tcl for Web Nerds Introduction

Created by Hal Abelson, Philip Greenspun, and Lydia Sandon, last modified by Anett Szabo 24 Jul 2007, at 04:05 PM

Evaluation and quoting

Each line of Tcl is interpreted as a separate command:

procedure_name arg1 arg2 arg3

Arguments are evaluated in sequence. The resulting values are then passed to procedure_name, which is assumed to be a system- or user-defined procedure. Tcl assumes that you're mostly dealing with strings and therefore stands some of the conventions of standard programming languages on their heads. For example, you might think that set foo bar would result in Tcl complaining about an undefined variable (bar). But actually what happens is that Tcl sets the variable foo to the character string "bar":

> tclsh
% set foo bar
bar
% set foo 
bar
%

The first command line illustrates that the set command returns the new value that was set (in this case, the character string "bar"), which is the result printed by the interpreter. The second command line uses the set command again, but this time with only one argument. This actually gets the value of the variable foo. Notice that you don't have to declare variables before using them.

By analogy with Scheme, you'd think that we could just type the command line $foo and have Tcl return and print the value. This won't work in Tcl, however, which assumes that every command line invokes a procedure. This is why we need to explicity use set or puts.

Does this mean that you never need to use string quotes when you've got string literals in your program? No. In Tcl, the double quote is a grouping mechanism. If your string literal contains any spaces, which would otherwise be interpreted as argument separators, you need to group the tokens with double quotes:

% set the_truth Lisp is the world's best computer language
wrong # args: should be "set varName ?newValue?"
% set the_truth "Lisp is the world's best computer language"
Lisp is the world's best computer language

In the first command above, the Tcl interpreter saw that we were attempting to call set with seven arguments. In the second command, we grouped all the words in our string literal with the double quotes and therefore the Tcl interpreter saw only two arguments to set. Note a stylistic point here: multi-word variable names are all-lowercase with underscores separating the words. This makes our Tcl code very compatible with relational database management systems where underscore is a legal character in a column name.

In this example, we invoked the Unix command "tclsh" to start the Tcl interpreter from the Unix shell. Later on we'll see how to use Tcl in other ways:

writing a program file and evaluating it
embedding Tcl commands in Web pages to create dynamic pages
extending the behavior of the Web server with Tcl programs

For now, let's stick with typing interactively at the shell. You can keep evaluating Tcl commands at the % prompt until you exit the Tcl shell by evaluating exit.

To indicate a literal string that contains a space, you can wrap the string in double quotes. Quoting like this does not prevent the interpreter from evaluating procedure calls and variables inside the strings:

% set checking_account_balance [expr {25 + 34 + 86}]
145
% puts "your bank balance is $checking_account_balance dollars"
your bank balance is 145 dollars
% puts "ten times your balance is [expr {10 * $checking_account_balance}] dollars"
ten times your balance is 1450 dollars

The interpreter looks for dollar signs and square brackets within quoted strings. This is known as variable interpolation What if you need to include a dollar sign or a square bracket? One approach is to escape with backslash:

% puts "your bank balance is \$$checking_account_balance"
your bank balance is $145
% puts "your bank balance is \$$checking_account_balance \[pretty sad\]"
your bank balance is $145 [pretty sad]

If we don't need Tcl to evaluate variables and procedure calls inside a string, we can use braces for grouping rather than double quotes:

% puts {your bank balance is $checking_account_balance dollars}
your bank balance is $checking_account_balance dollars
% puts {ten times your balance is [expr {10 * $checking_account_balance}] dollars}
ten times your balance is [expr {10 * $checking_account_balance}] dollars

Throughout the rest of this book you'll see hundreds of examples of braces being used as a grouping character for Tcl code. For example, when defining a procedure or using control structure commands, conditional code is grouped using braces.

Keep it all on one line!

The good news is that Tcl does not suffer from cancer of the semicolon. The bad news is that any Tcl procedure call or command must be on one line from the interpreter's point of view. Suppose that you want to split up

% set a_very_long_variable_name "a very long value of some sort..."

If you want to have newlines within the double quotes, that's just fine:

% set a_very_long_variable_name "a very long value of some sort... 
with a few embedded newlines
makes for rather bad poetry"
a very long value of some sort... 
with a few embedded newlines
makes for rather bad poetry
%

It also works to do it with braces

% set a_very_long_variable_name {a very long value of some sort... 
with a few embedded newlines
makes for rather bad poetry}
a very long value of some sort... 
with a few embedded newlines
makes for rather bad poetry
%

but if you were to try

set a_very_long_variable_name 
    "a very long value of some sort... 
     with a few embedded newlines
     makes for rather bad poetry"

Tcl would interpret this as two separate commands, the first a call to set to find the existing value of a_very_long_variable_name and the second a call to the procedure named "a very long value...":

can't read "a_very_long_variable_name": no such variable
invalid command name "a very long value of some sort... 
     with a few embedded newlines
     makes for rather bad poetry"

If you want to continue a Tcl command on a second line, it is possible to use the backslash to escape the newline that would otherwise terminate the command:

% set a_very_long_variable_name \
      "a very long value of some sort... 
       with a few embedded newlines
       makes for rather bad poetry"
a very long value of some sort... 
       with a few embedded newlines
       makes for rather bad poetry
%

Note that this looks good as code but the end-result is probably not what you'd want. The second and third lines of our poem contain seven spaces at the beginning of each line. You probably want to do something like this:

% set a_very_long_variable_name "a very long value of some sort... 
with a few embedded newlines
makes for rather bad poetry"

Case Sensitivity, Poisonous Unix Heritage, and Naming Conventions

The great case-sensitivity winter descended upon humankind in 1970 with the Unix operating system.

% set MyAge 36
36
% set YearsToExpectedDeath [expr {80-$Myage}]
can't read "Myage": no such variable
%

Variables and procedure names in Tcl are case-sensitive. We consider it very bad programming style to depend on this, though. For example, you shouldn't simultaneously use the variables Username and username and rely on the computer to keep them separate; the computer will succeed but humans maintaining the program in the future will fail. So use lowercase all the time with underscores to separate words!

Procedures

One of the keys to making a large software system reliable and maintainable is procedural abstraction. The idea is to take a complex operation and encapsulate it into a function that other programmers can call without worrying about how it works.

To define a procedures in Tcl use the following syntax:

proc name { list_of_arguments } {
   body_expressions
}

This creates a procedure with the name "name." Tcl has a global environment for procedure names, i.e., there can be only one procedure called "foobar" in a Tcl system.

The next part of the syntax is the set of arguments, delimited by a set of curly braces. Each argument value is then mapped into the procedure body, which is also delimited by curly braces. As before, each statement of the procedure body can be separated by a semi-colon or a newline (or both). Here's an example, taken from the calendar widget component of the ArsDigita Community System:

proc calendar_convert_julian_to_ansi { date } {
    set db [ns_db gethandle subquery]
    # make Oracle do all the real work
    set output [database_to_tcl_string $db \ 
        "select trunc(to_date('$date', 'J')) from dual"]
    ns_db releasehandle $db
    return $output
}

As you can see the variable "date" is set in the context of the procedure so you can address the value with "$date".

Here's the factorial procedure in Tcl:

% #this is good old recursive factorial
% proc factorial {number} {
    if { $number == 0 } {
	return 1
    } else {                 
	return [expr {$number * [factorial [expr {$number - 1}]]}]
    }
}
% factorial 10
3628800

At first glance, you might think that you've had to learn some new syntax here. In fact, the Tcl procedure-creation procedure is called like any other. The three arguments to proc are procedure_name arglist body. The creation command is able to extend over several lines not because the interpreter recognizes something special about proc but because we've used braces to group blocks of code. Similarly the if statement within the procedure uses braces to group its arguments so that they are all on one line as far as the interpreter is concerned.

As the example illustrates, we can use the standard base-case-plus-recursion programming style in Tcl. Our factorial procedure checks to see if the number is 0 (the base case). If so, it returns 1. Otherwise, it computes factorial of the number minus 1 and returns the result multiplied by the number. The # character signals a comment.

Examine the following example:

% set checking_account_balance [expr {25 + 34 + 86}]
145
% puts "\nYour checking balance is \$$checking_account_balance.
If you're so smart, why aren't you rich like Bill Gates?
He probably has \$[factorial $checking_account_balance] by now."


Your checking balance is $145.
If you're so smart, why aren't you rich like Bill Gates?
He probably has $0 by now.

There are a few things to observe here:

The "\n" at the beginning of the quoted string argument to puts resulted in an extra newline in front of the output.
The newline after balance. did not terminate the puts command. The string quotes group all three lines together into a single argument.
The [factorial... ] procedure call was evaluated but resulted in an output of 0. This isn't a bug in the evaluation of quoted strings, but rather a limitation of the Tcl language itself:
```
% factorial 145
0
```

Exercises

1. Write the identity function (the I combinator), which simply returns its argument (any type of argument) unchanged:

# I :: alpha -> alpha
proc I {x} {...}

Examples:

I 12
=> 12
I foo
=> foo
I {string length abracadabra}
=> {string length abracadabra}

Why is this a useful function?

Answer

2. Write the K combinator, which takes two arguments and always returns its first argument, unchanged:

# K :: alpha beta -> alpha
proc K {x y} {...}

Examples:

K 0 456
=> 0
K foo 1
=> foo

Why is this a useful function?

Answer

---

based on Tcl for Web Nerds

Categories: beginner (Audience)

List Operations

Created by Anett Szabo, last modified by Anett Szabo 24 Jul 2007, at 01:19 PM

A Tcl list holds a sequence of elements, each of which can be a number, a string, or another list. Let's look at the commands for constructing a list:

% # create an empty list using the list command  
% set user_preferences [list]
% # verify that we've created a 0-item list
% llength $user_preferences
0
% lappend user_preferences "hiking"
hiking
% lappend user_preferences "biking"
hiking biking
% lappend user_preferences "whale watching"
hiking biking {whale watching}
% llength $user_preferences
3

At this point, the variable user_preferences is a three-element list. We can pull individual items out with lindex:

% lindex $user_preferences 0
hiking
% lindex $user_preferences 1
biking
% lindex $user_preferences 2
whale watching
% lindex $user_preferences 3
% lindex $user_preferences 5

Note, that lindex list 0 gives the first element of the list! (Indexing is 0-based and lindex will return the empty string rather than an error if you supply an out-of-range index.)

When producing a page for a user, we'd be more likely to be interested in searching the list. The command lsearch returns the index of the list element matching a query argument or -1 if unsuccessful:

if { [lsearch -exact $user_preferences "hiking"] != -1 } {
    # look for new articles related to hiking 
}

Concat

Suppose that User A marries User B. You want to combine their preferences into a household_preferences variable using the concat command:

% # use the multiple-argument form of list to create an N-element
% # list with one procedure call
% set spouse_preferences [list "programming" "computer games" "slashdot"]
programming {computer games} slashdot
% set household_preferences [concat $user_preferences $spouse_preferences]
hiking biking {whale watching} programming {computer games} slashdot
% llength $household_preferences
6

Split and Join

Suppose we have a file called addressees.txt with information about people, one person to a line. Suppose each of these lines contains, among other information, an email address which we assume we can recognize by the presence of an at-sign (@). The following program extracts all the email addresses and joins them together, separated by commas and spaces, to form a string called spam_address that we can use as the Bcc: field of an email message, to spam them all:

# open the file for reading
set addressees_stream [open "~/addressees.txt" r]

# read entire file into a variable
set contents_of_file [read $addressees_stream] 

close $addressees_stream

# split the contents on newlines
set list_of_lines [split $contents_of_file "\n"]

# loop through the lines 
foreach line $list_of_lines {
    if { [regexp {([^ ]*@[^ ]*)} $line one_address] } {
        lappend all_addresses $one_address
    }
}

# use the join command to mush the list together
set bcc_line_for_mailer [join $all_addresses ", "]

Some things to observe here:

We've used the foreach operator (see the chapter on control structure) to iterate over the list formed by splitting the file at newline characters.
We use pattern matching to extract the email address from each line. The pattern here specifies "stuff that doesn't contain a space, followed by at-sign, followed by stuff that doesn't contain a space." (See the explanation of regexp in the chapter on pattern matching.)
The iteration keeps lappending to all_addresses, but this variable is never initialized. lappend treats an unbound list variable the same as an empty list.

Reference: List operations

list arg1 arg2 ...
Construct and return a list the arguments. Akin to (list arg1 arg2...) in Scheme.
```
set foo [list 1 2 [list 3 4 5]] ==> 1 2 {3 4 5}
```
or
```
set foo {1 2 {3 4 5}}==> 1 2 {3 4 5}
```
lset varName ?index...? newValueGives you the opportunity to insert elements at a given position and work with positions within a list in general..

lindex list i
Returns the ith element from list; starts at index 0.

llength $foo ==> 1
llength list
Returns the number of elements in list.
```
llength $foo ==> 3
```
lrange list i j
Returns the ith through jth elements from list.
```
lrange $foo 1 2 ==> 2 {3 4 5}
```
lappend listVar arg arg...
Append elements to the value of listVar and reset listVar to the new list. Please note that listVar is the name of a variable and not the value, i.e., you should not put a $ in front of it (the same way that set works.
```
lappend foo [list 6 7] ==> 1 2 {3 4 5} {6 7}
set foo ==> 1 2 {3 4 5} {6 7}
```
linsert list index arg arg...
Insert elements into list before the element at position index. Returns a new list.
```
linsert $foo 0 0  ==> 0 1 2 {3 4 5} {6 7}
```
lreplace list i j arg arg...
Replace elements i through j of list with the args. Returns a new list and leaves the original list unmodified.
```
lreplace $foo 3 4 3 4 5 6 7 ==> 0 1 2 3 4 5 6 7
set foo ==> 1 2 {3 4 5} {6 7}
```
lsearch mode list value
Return the index of the element in list that matches the value according to the mode, which is -exact, -glob, or -regexp. -glob is the default. Return -1 if not found.
```
set community_colleges [list "caltech" "cmu" "rpi"]
lsearch -exact $community_colleges "caltech" ==> 0
lsearch -exact $community_colleges "harvard" ==> -1
```
lsort switches list
Sort elements of the list according to the switches: -ascii, -integer, -real, -increasing, -decreasing, -command command. Returns a new list.
```
set my_friends [list "herschel" "schlomo" "mendel"]
set my_sorted_friends [lsort -decreasing $my_friends] ==> schlomo mendel herschel 
```

concat arg arg...
Join multiple lists together into one list.

set my_wifes_friends [list "biff" "christine" "clarissa"]
concat $my_wifes_friends $my_friends ==> biff christine clarissa herschel schlomo mendel

join list joinString
Merge the elements of list to produce a string, where the original list elements are separated by joinString. Returns the resulting string. Note: if you want to merge without separating the elements, just put "" .
```
set foo_string [join $foo ":"] ==> 0:1:2:3:4:5:6:7
```
split string splitChars
Split a string to produce a list, using (and discarding) the characters in splitChars as the places where to split. Returns the resulting list.
```
set my_ip_address 18.1.2.3 ==> 18.1.2.3
set ip_elements [split $my_ip_address "."] ==> four element list,
with values 18 1 2 3
```

Exercises

1. Write the iota function, which takes a numeric argument n and returns a list of numbers of length n which are the numbers from 0 to n-1.

# iota :: num -> [num]
proc iota {n} {...}

Examples:

iota 3
=> 0 1 2
iota 20
=> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Answer

2. Write the function incrlist, which takes one argument, a list of numbers, and returns a list of equal length as a result, in which each element is the successor to the corresponding element of the argument list.

# incrlist :: [num] -> [num]
proc incrlist {L} {...}
incrlist {34 987 1 567 -23 8}
=> 35 988 2 568 -22 9
incrlist [iota 12]
=> 1 2 3 4 5 6 7 8 9 10 11 12

Hint: Arithmetic is not done directly by the Tcl interpreter. It is done by calling the C library using the expr command on arthmetic expressions.

Does your function work for the empty list?

Answer

3. Write the function strlenlist, which takes one argument, a list of strings, and returns a list of equal length as a result, in which each element is the string length of the corresponding element of the argument list.

# strlenlist :: [str] -> [num]
proc strlenlist {L} {...}
strlenlist {34 987 1 567 -23 8}
=> 2 3 1 3 3 1
strlenlist {foo bar antidisestablishmentarianism}
=> 3 3 28

Does your function work for the empty list?

Answer

4. Write sumlist, which takes one argument, a list of numbers, and returns, as result, a single number: the sum of the numbers in the argument list.

# sumlist :: [num] -> num
proc sumlist {L} {...}
sumlist [iota 3]
=> 3
sumlist {34 987 1 567 -23 8}
=> 1574

Answer

5. Write multlist, which takes one argument, a list of numbers, and returns, as result, a single number: the product of the numbers in the argument list.

# multlist :: [num] -> num
proc multlist {L} {...}
multlist [iota 3]
=> 0
multlist [incrlist [iota 3]]
=> 3
multlist {34 987 1 567 -23 8}
=> 793928272

Answer

6. Write catlist, which takes one argument, a list of strings, and returns, as result, a single string: the concatenation of the strings in the argument list.

# catlist :: [str] -> str
proc catlist {L} {...}
catlist [iota 3]
=> 012
catlist [incrlist [iota 3]]
=> 123
list {foo bar antidisestablishmentarianism}
=> foobarantidisestablishmentarianism

Answer

Data abstraction with lists

The Tcl shell's output is giving you an ugly insight into the internal representation of lists as strings, with elements being separated by spaces and grouped with braces. There is no reason to rely on how Tcl represents lists or even think about it. Practice data abstraction by using Tcl lists as your underlying storage mechanism but define constructor and accessor procedures that the rest of your source code invokes. You won't find a huge amount of this being done in Web development because the really important data structures tend to be RDBMS tables, but here's an example of how it might work, taken from http://photo.net/philg/careers/four-random-people.tcl. We're building a list of lists. Each sublist contains all the information on a single historical figure. Method A is quick and dirty:

set einstein [list "A. Einstein" "Patent Office Clerk" "Formulated Theory of Relativity."]

set mill [list "John Stuart Mill" "English Youth" "Was able to read Greek and Latin at age 3."]

# let's build the big list of lists
set average_folks [list $einstein $mill ...]

# let's pull out Einstein's title 
set einsteins_title [lindex $einstein 1]

Method B uses data abstraction:

proc define_person {name title accomplishment} {
    return [list $name $title $accomplishment]
}

proc person_name {person} {
    return [lindex $person 0]
}

proc person_title {person} {
    return [lindex $person 1]
}

proc person_accomplishment {person} {
    return [lindex $person 2]
}

% set einstein [define_person "A. Einstein" "Patent Office Clerk" "Formulated Theory of Relativity."]
{A. Einstein} {Patent Office Clerk} {Formulated Theory of Relativity.}
% set einsteins_title [person_title $einstein]
Patent Office Clerk

Data abstraction will make building and maintaining a large system much easier. As noted above, however, the stateless nature of HTTP means that any information you want kept around from page to page must be kept by the RDBMS. SQL already forces you to refer to data by table name and column name rather than positionally.

---

based on Tcl for Web Nerds

Categories: beginner (Audience)

Pattern matching

Created by Anett Szabo, last modified by Anett Szabo 24 Jul 2007, at 11:56 AM

Pattern matching is important across a wide variety of Web programming tasks but most notably when looking for exceptions in user-entered data and when trying to parse information out of non-cooperating Web sites.

Tcl's pattern matching facilities test whether a given string matches a specified pattern. Patterns are described using a syntax known as regular expressions. For example, the pattern expression consisting of a single period matches any character. The pattern a..a matches any four-character string whose first and last characters are both a.

The regexp command takes a pattern, a string, and an optional match variable. It tests whether the string matches the pattern, returns 1 if there is a match and zero otherwise, and sets the match variable to the part of the string that matched the pattern:

% set something candelabra
candelabra

% regexp a..a $something match
1

% set match
abra

Patterns can also contain subpatterns (delimited by parentheses) and denote repetition. A star denotes zero or more occurrences of a pattern, so a(.*)a matches any string of at least two characters that begins and ends with the character a. Whatever has matched the subpattern between the a's will get put into the first subvariable:

% set something candelabra
candelabra

% regexp a(.*)a $something match
1

% set match
andelabra

Note that Tcl regexp by default behaves in a greedy fashion. There are three alternative substrings of "candelabra" that match the regexp a(.*)a: "andelabra", "andela", and "abra". Tcl chose the longest substring. This is very painful when trying to pull HTML pages apart:

% set simple_case "Normal folks might say <i>et cetera</i>"
Normal folks might say <i>et cetera</i>
% regexp {<i>(.+)</i>} $simple_case match italicized_phrase
1

% set italicized_phrase
et cetera

% set some_html "Pedants say <i>sui generis</i> and <i>ipso facto</i>"
Pedants say <i>sui generis</i> and <i>ipso facto</i>
% regexp {<i>(.+)</i>} $some_html match italicized_phrase
1

% set italicized_phrase
sui generis</i> and <i>ipso facto

What you want is a non-greedy regexp, a standard feature of Perl and an option in Tcl 8.1 and later versions (see http://www.scriptics.com/services/support/howto/regexp81.html).

Lisp systems in the 1970s included elegant ways of returning all possibilities when there were multiple matches for an expression. Java libraries, Perl, and Tcl demonstrate the progress of the field of computer science by ignoring these superior systems of decades past.

Matching Cookies From the Browser

A common problem in Web development is pulling information out of cookies that come from the client. The cookie spec at http://home.netscape.com/newsref/std/cookie_spec.html mandates that multiple cookies be separated by semicolons. So you look for "the cookie name that you've been using" followed by an equals sign and them slurp up anything that follows that isn't a semicolon. Here is how the ArsDigita Community System looks for the value of the last_visit cookie:

regexp {last_visit=([^;]+)} $cookie match last_visit

Note the square brackets inside the regexp. The Tcl interpreter isn't trying to call a procedure because the entire regexp has been grouped with braces rather than double quotes. Square brackets denote a range of acceptable characters:

[A-Z] would match any uppercase character
[ABC] would match any of first three characters in the alphabet (uppercase only)
[^ABC] would match any character other than the first three uppercase characters in the alphabet, i.e., the ^ reverses the sense of the brackets

The plus sign after the [^;] says "one or more characters that meets the preceding spec", i.e., "one or more characters that isn't a semicolon". It is distinguished from * in that there must be at least one character for a match.

If successful, the regexp command above will set the match variable with the complete matching string, starting from "last_visit=". Our code doesn't make any use of this variable but only looks at the subvar last_visit that would also have been set.

Pages that use this cookie expect an integer and this code failed in one case where a user edited his cookies file and corrupted it so that his browser was sending several thousands bytes of garbage after the "last_visit=". A better approach might have been to limit the match to digits:

regexp {last_visit=([0-9]+)} $cookie match last_visit

Matching Into Multiple Variables

More generally regexp allows multiple pattern variables. The pattern variables after the first are set to the substrings that matched the subpatterns. Here is an example of matching a credit card expiration date entered by a user:

% set date_typed_by_user "06/02"
06/02

% regexp {([0-9][0-9])/([0-9][0-9])} $date_typed_by_user match month year
1

% set month
06

% set year
02
%

Each pair of parentheses corresponds to a subpattern variable.

Full Syntax

The most general form of regexp includes optional flags as well as multiple match variables:

regexp [flags] pattern data matched_result var1 var2 ...

The various flags are

-nocase
uppercase characters in the data are bashed down to lower for case-insensitive matching (make sure that your pattern is all lowercase!)
-indices
the returned values of the regexp contain the indices delimiting the matched substring, rather than the strings themselves.
If your pattern begins with a -, put a -- flag at the end of your flags

Regular expression syntax is:

.
matches any character.
*
matches zero or more instances of the previous pattern item.
+
matches one or more instances of the previous pattern item.
?
matches zero or one instances of the previous pattern item.
|
disjunction, e.g., (a|b) matches an a or a b
( )
groups a sub-pattern.
[ ]
delimits a set of characters. ASCII Ranges are specified using hyphens, e.g., [A-z] matches any character from uppercase A through lowercase z (i.e., any alphabetic character). If the first character in the set is ^, this complements the set, e.g., [^A-z] matches any non-alphabetic character.
^
Matches only when the pattern appears at the beginning of the string. The ^ must appear at the beginning of the pattern expression.
$
Matches only when the pattern appears at the end of the string. The $ must appear last in the pattern expression.

More: http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm

Matching with substitution

It's common in Web programming to create strings by substitution. Tcl's regsub command performs substitution based on a pattern:

regsub [flags] pattern data replacements var

matches the pattern against the data. If the match succeeds, the variable named var is set to data, with various parts modified, as specified by replacements. If the match fails, var is simply set to data. The value returned by regsub is the number of replacements performed.

The flag -all specifies that every occurrence of the pattern should be replaced. Otherwise only the first occurrence is replaced. Other flags include -nocase and -- as with regexp

Here's an example from the banner ideas module of the ArsDigita Community System (see http://photo.net/doc/bannerideas.html). The goal is that each banner idea contain a linked thumbnail image. To facilitate cutting and pasting of the image html, we don't require that the publisher include uniform subtags within the IMG. However, we use regexp to clean up:

# turn "<img align=right hspace=5" into "<img align=left border=0 hspace=8"
regsub -nocase {align=[^ ]+} $picture_html "" without_align
regsub -nocase {hspace=[^ ]+} $without_align "" without_hspace
regsub -nocase {<img} $without_hspace {<img align=left border=0 hspace=8} final_photo_html

In the example above, <replacements> specified the literal characters ''. Other replacement directives include:

& inserts the string that matched the pattern
The backslashed numbers \1 through \9 inserts the strings that matched the corresponding sub-patterns in the pattern.

Here's another web example, which parses HTML, and replaces the comments (delineated in HTML by ) by the comment text, enclosed in parentheses.

% proc extract_comment_text {html} {
    regsub -all {<!--([^-]*)-->} $html {(\1)} with_exposed_comments
    return $with_exposed_comments
}

% extract_comment_text {<!--insert the price below-->
We give the same low price to everyone: $219.99
<!--make sure to query out discount if this is one of our big customers-->}
(insert the price below)
We give the same low price to everyone: $219.99
(make sure to query out discount if this is one of our big customers)

More: http://www.tcl.tk/man/tcl8.4/TclCmd/regsub.htm

String match

Tcl provides an alternative matching mechanism that is simpler for users to understand than regular expressions. The Tcl command string match uses "GLOB-style" matching. Here is the syntax:

string match pattern data

It returns 1 if there is a match and 0 otherwise. The only pattern elements permitted here are ?, which matches any single character; *, which matches any sequence; and [], which delimits a set of characters or a range. This differs from regexp in that the pattern must match the entire string supplied:

% regexp "foo" "foobar"
1

% string match "foo" "foobar"
0

% # here's what we need to do to make the string match 
% # work like the regexp
% string match "*foo*" foobar
1

Here's an example of the character range system in use:

string match {*[0-9]*} $text

returns 1 if text contains at least one digit and 0 otherwise.

More: http://www.tcl.tk/man/tcl8.4/TclCmd/string.htm

Exercises

Write a procedure which takes a string and makes sure that the result contains an "@" sign
Extend the procedure to make sure that only letters, numbers are allowed before the "@" sign
Extend the procedure to check that after the @ sign comes a valid domain (hint, look at 2.) A valid domain contains of at least one "." and only letters after the last ".". so malte.cognovis.de is a valid domain, cognovis.d1e is not.
Extend the procedure to return "Welcome foo, member of bar.com" if the string is "foo@bar.com"
Extend the procedure to return "Welcome OpenACS member foo" if the string is like "foo@openacs.org" meaning, the e-mail ends with openacs.org
Check against the valid domain again. This time make use of the ad_locales table installed in your local copy of OpenACS. To make this work you will have to use the OpenACS Shell.

Get a list of all countries from the table ad_locales. Choose the language column for this. The command to extract this is "db_list".
If your list contains the language "ca" more than once, make sure to limit it to one "ca" only. Make sure this works for others as well.
As ".com" ".org" and ".net" are also valid domain ending append them to the list.
Make sure that the domain ends on any language defined in your list you created. So automotive.ca works but automotive.eu does not (and yes, I know that .eu is now a valid domain :-)).

Answer

Search at amazon.com for your favorite book. Copy the URL until you see the "/ref..." part, e.g. http://www.amazon.com/4-Hour-Workweek-Escape-Live-Anywhere/dp/0307353133
In the OpenACS shell use "ad_httpget" to retrieve the URL you copied. Look at the api doc for the syntax.
Use regexp to find the price of the book in the html source returned to you by ad_httpget
Return the price of the book.

Answer

---

based on Tcl for Web Nerds

Categories: beginner (Audience)

Numbers

Created by Anett Szabo, last modified by Anett Szabo 11 Jul 2007, at 04:47 PM

Arithmetic is not done directly by the Tcl interpreter. It is done by calling the C library using the expr command on arthmetic expressions. The detailed parsing rules for arithmetic expressions depend on the particular Unix implementation, but they are more or less like in C.

Here are some examples:

# integer division truncates
% expr {7 / 2}
3

# the percent sign is used to compute integer remainder
% expr {7 % 2} 
1

# floating point propagates
% expr {7.0 / 2}
3.5

% expr {sin(.5) + cos(.9)}
1.10103550687

% # a zero in front of number means to interpret as octal
% expr {017 + 01}
16

% # a 0x in front means to interpret as hex
% expr {0xA + 1}
11

% # numbers can be treated like strings!
% string length 100.34
6
% string range 100.34 0 2
100

More: http://www.tcl.tk/man/tcl8.4/TclCmd/expr.htm

Reference

Here are the numeric functions included in Tcl. (Details may vary depending on your Unix implementation of expr.)

abs(x)
asin(x)
acos(x)
atan(x)
atan2(y,x)
atan2 returns the angle theta of the polar coordinates returned when (x,y) is converted to (r, theta).
ceil(x)
cos(x)
cosh(x)
double(x)
returns x as a double or floating point.
exp(x)
returns e^x
floor(x)
fmod(x,y)
returns the floating point remainder of x/y.
hypot(x,y)
returns the square root of the sum of x squared plus y squared, the length of the line from (0,0) to (x,y).
int(x)
truncates x to an integer.
log(x)
returns the natural log of x.
log10(x)
returns log base 10 of x.
pow(x,y)
returns x to the y power.
rand()
round(x)
sin(x)
sinh(x)
sqrt(x)
tan(x)
tanh(x)
wide(x)

---

based on Tcl for Web Nerds

Categories: beginner (Audience)

Exec

Created by Anett Szabo, last modified by Anett Szabo 11 Jul 2007, at 04:29 PM

You can invoke other Unix programs from within a Tcl script. For a Web developer this is either a tremendous convenience or a dangerous security hole.

Let's cover the convenience part first. If we want to have a Web page that displays how long the server has been up and the current load average, the exec command is helpful:

% exec /usr/bin/uptime
  4:04pm  up 29 days,  4:32,  3 users,  load average: 0.61, 0.66, 0.63

To make an AOLserver Tcl page that returns this output, you need only wrap the exec in an API call:

ns_return 200 text/plain [exec /usr/bin/uptime]

The photo sharing system at http://photo.net/photodb/ stores user-uploaded content in the Unix file system and hence makes extensive use of Unix commands to create directories, build thumnails of uploaded images, etc. Here are some examples of how the photodb system uses exec:

# see how much disk space a user is consuming
set disk_usage [exec du -k $data_path]

# find out how large an uploaded image is (X-Y pixel size)
# by invoking the ImageMagick command "identify"
set identify_str [exec /usr/local/bin/identify -verbose $filename]
regexp {geometry: ([0-9]*)x([0-9]*)} $identify_str match image_x image_y

# create a thumbnail-sized image using the ImageMagick command "convert"
set result_status [exec /usr/local/bin/convert $filename -geometry $size_sm -quality 75 $filename_sm]

# remove a directory of user-uploaded images
if [catch { exec rm -f $path } errmsg] ...

The Dangerous Part

Scripting languages like Perl or Tcl are convenient for Web development but it is possible to write a script that takes user-supplied input and evaluates it. With Tcl, you run the risk that a user will upload a string containing "[exec /usr/openwin/bin/xterm -display 18.30.0.1]". Because of the [] characters, if this string is ever fed to eval or subst a cracker would have a shell on your Web server.

If you don't need to use exec an easy solution to the problem is to redefine it:

% proc exec args { return "happy happy joy joy" }
% exec cat /etc/passwd
happy happy joy joy

If you do need to use exec, at least make sure that your Web server is running as an unprivileged user with limited authority to execute Unix programs. Depending on your publishing requirements and choice of Web server, it may be possible to run the Web server in a chroot() environment (this is very easy with AOLserver 3.0). This changes the root directory as far as the Web server is concerned. Thus a Tcl program running within the Web server will not be able to even look at files or programs elsewhere on the computer.

If you decide to run chrooted, you will have to copy any programs that you actually do need to exec so that they are underneath the Web server's root directory.

More: http://www.tcl.tk/man/tcl8.4/TclCmd/exec.htm

This is the last of the main Tcl topics. For more information, check out aolserver.com and the books referenced in this book.

---

based on Tcl for Web Nerds

Categories: beginner (Audience)