German order reigns inside the system itself: messages are uniquely keyed with
create table bboard (
msg_id char(6) not null primary key,
refers_to char(6),
email varchar(200),
name varchar(200),
one_line varchar(700),
message clob,
notify char(1) default 'f' check (notify in ('t','f')),
posting_time date,
sort_key varchar(600)
);
msg_id, refer to each other (i.e., say "I'm a
response to msg X") with refers_to, and a thread can be
displayed conveniently by using the sort_key column.
Italian chaos is permitted in the email and
name columns; users could remain anonymous, masquerade as
"president@whitehouse.gov" or give any name.
This seemed like a good idea when I built the system. I was concerned that it work reliably. I didn't care whether or not users put in bogus content; the admin pages made it really easy to remove such postings and, in any case, if someone had something interesting to say but needed to remain anonymous, why should the system reject their posting?
One hundred thousand postings later, as the moderator of the photo.net Q&A forum, I began to see the dimensions of my data modeling mistakes.
First, anonymous postings and fake email addresses didn't come from Microsoft employees revealing the dark truth about their evil bosses. They came from complete losers trying and failing to be funny or wishing to humiliate other readers. Some fake addresses came from people scared by the rising tide of spam email (not a serious problem back in 1995).
Second, I didn't realize how the combination of my email alert systems, fake email addresses, and Unix mailers would result in my personal mailbox filling up with messages that couldn't be delivered to "asdf@asdf.com" or "duh@duh.net".
Although the solution involved changing some Web scripts, fundamentally the fix was add a column to store the IP address from which a post was made:
Keeping these data enabled me to see that most of the anonymous posters were people who'd been using the forum for some time, typically from the same IP address. I just sent them mail and asked them to stop, explaining the problem with bounced email.
alter table bboard add (originating_ip varchar(16));
After four years of operating the photo.net community, it became apparent that we needed ways to
Note that a contributor's name and email address no longer appear in the
create table users (
user_id integer not null primary key,
first_names varchar(100) not null,
last_name varchar(100) not null,
email varchar(100) not null unique,
..
);
create table bboard (
msg_id char(6) not null primary key,
refers_to char(6),
topic varchar(100) not null references bboard_topics,
category varchar(200), -- only used for categorized Q&A forums
originating_ip varchar(16), -- stored as string, separated by periods
user_id integer not null references users,
one_line varchar(700),
message clob,
-- html_p - is the message in html or not
html_p char(1) default 'f' check (html_p in ('t','f')),
...
);
create table classified_ads (
classified_ad_id integer not null primary key,
user_id integer not null references users,
...
);
bboard table. That doesn't mean we don't know who posted a
message. In fact, this data model can't even represent an anonymous
posting: user_id integer not null references users
requires that each posting be associated with a user ID and that there
actually be a row in the users table with that ID.
First, let's talk about how much fun it is to move a live-on-the-Web
600,000 hit/day service from one data model to another. In this case,
note that the original bboard data model had a single
name column. The community system has separate columns for
first and last names. A conversion script can easily split up "Joe
Smith" but what is it to do with William Henry Gates
III?
How do we copy over anonymous postings? Remember that Oracle is not
flexible or intelligent. We said that we wanted every row in the
bboard table to reference a row in the users
table. Oracle will abort any transaction that would result in a
violation of this integrity constraint. So we either have to drop all
those anonymous postings (and any non-anonymous postings that refer to
them) or we have to create a user called "Anonymous" and assign all the
anonymous postings to that person. The technical term for this kind of
solution is kludge.
A more difficult problem than anonymous postings is presented by long-time users who have difficulty typing and or keeping a job. Consider a user who has identified himself as
Once we've copied everything into this new normalized data model, notice that we can't dig ourselves into the same hole again. If a user has contributed 1000 postings, we don't have 1000 different records of that person's name and email address. If a user changes jobs, we need only update one column in one row in one table.
The html_p column in the new data model is worth
mentioning. In 1995, I didn't understand the problems of user-submitted
data. Some users will submit plain text, which seems simple, but in
fact you can't just spit this out as HTML. If user A typed < or >
characters, they might get swallowed by user B's Web browser. Does this
matter? Consider that "<g>" is interpreted in various online
circles as an abbreviation for "grin" but by Netscape Navigator as an
unrecognized (and therefore ignore) HTML tag. Compare the meaning of
"We shouldn't think it unfair that Bill Gates has more wealth than the 100 million poorest Americans combined. After all, he invented the personal computer, the graphical user interface, and the Internet."with
"We shouldn't think it unfair that Bill Gates has more wealth than the 100 million poorest Americans combined. After all, he invented the personal computer, the graphical user interface, and the Internet. <g>"
It would have been easy enough for me to make sure that such characters
never got interpreted as markup. In fact, with AOLserver one can do it
with a single call to the built-in procedure ns_quotehtml.
However, consider the case where a nerd posts some HTML. Other users
would then see
"For more examples of my brilliant thinking and modesty, check out <a href="http://philip.greenspun.com/">my home page</a>."I discovered that the only real solution is to ask the user whether the submission is an HTML fragment or plain text, show the user an approval page where the content may be previewed, and then remember what the user told us in an
html_p column in the database.
Is this data model perfect? Permanent? Absolutely. It will last for at least... Whoa! Wait a minute. I didn't know that Dave Clark was replacing his original Internet Protocol, which the world has been running since around 1980, with IPv6 (http://www.faqs.org/rfcs/rfc2460.html). In the near future, we'll have IP addresses that are 128 bits long. That's 16 bytes, each of which takes two hex characters to represent. So we need 32 characters plus at least 7 more for periods that separate the hex digits. We might also need a couple of characters in front to say "this is a hex representation". Thus our brand new data model in fact has a crippling deficiency. How easy is it to fix? In Oracle:
You won't always get off this easy. Oracle won't let you shrink a column from a maximum of 50 characters to 16, even if no row has a value longer than 16 characters. Oracle also makes it tough to add a column that is constrained
alter table bboard modify (originating_ip varchar(50));
not null.