Forum OpenACS Development: Why unique constraints should be treated with care

Within the OCT i started a conversation about the sense of using a unique constraint on "e-mail".

I know from the history of OpenACS that the system relied on unique e-mail address for login purposes, but now the system is expanding and is hindered by this constraint e.g. for a CRM setting. My experience has shown and hopefully others can jump in as well, that for CRM purposes e-mails are not unique. This is why most sites today ask for a username and not an e-mail because other systems already detected the fact that e-mails are not unique.

Example:

An aweful lot of companies use “info@companyname” addresses. So some of the employees have the info@... Address and some don’t (have their own one). Still they are separate users in the system.

I faced this situation now in three systems where I had to import data from the client. In the first installation I managed to get away with not adding the e-mail to the second person but state “mailto:other_party_id@foo.bar”. But in the other two setups we had around 4000 doublicates each. And sadly not always for the same company (otherwise I could have tricked the system and store the e-mail with the organization_id and rewrite any application to check for the e-mail of the organization if there is none for the party (which we actually do anyway)).

You could then argue that this is not a perfect scenario and the clients should clean up their data (which they will do eventually), but if some other users are going to use contacts to import customer CRM data they are in for a big surprise with the unique constraint on

e-mail
organization name

Did I mention unique constraint on organization name is a really bad idea? I know, in a perfect world a company name should only exist once. But guess what. World isn’t perfect :). Anyway, this is just another example where a developer has created a perfect setting not reflecting the realities in live. Sorry for getting started on this, obviously we dropped the unique constraint in organizations. Let the users find out which organization they want if they have multiple ones with the same name (which is easy to do if you use contacts as usually there is a second company_name or an address which helps differentiating).

Therefore I'd encourage developers to be *really* careful with unique constraints and make a check if this unique constraint is reflected in the real world. Usually it isn't and you should never code a unique constraint just because it makes your live easier.

Collapse
Posted by Jon Griffin on
Malte,
I have brought this up for years and even had a patch in a 4.x series. I am surprised that it is only now being looked at again.
I talked at length with Don B and made the easy patch, and IIRC we didn't add it because at the time, PG didn't allow alter columns or some such.

http://jongriffin.com/static/openacs/no-email/no-email

Collapse
Posted by Malte Sussdorff on
A lot of discussion has been going on on the OCT mailing list. I would like Dave and Don to comment here as well, the main point is:

- Overwrite in your custom package the party::get_id_by_email function as you obviously do with the altering of the table.

As we can do this in general I will not bother further (aka write a TIP) as the feeling from the OCT is at the moment that it is opposed to the idea.

OTOH as this is a required feature by many (as it seems), maybe we should write a package called "zz-email-not-unique" so others can install it to work around the custom behaviour in OpenACS.

Collapse
Posted by Dave Bauer on
Malte,

I don't have an opinion either way, I just wanted the change, if there is one to go through the proper process. Thanks for bringing this up for discussion.

On email, I think requiring unique makes sense since we use the email to actually _send_ email out. So in your case where several people use "mailto:info@company.com"; it makes sense to leave email blank instead. It makes no senese to use email at all like that. In a community system where family members share an email, I don't know the best way to support that, that makes sense. I guess you'd risk duplicate emails to the same address and there isn't much you can do about it.

such group mail accounts are pretty common, such as secretary@institute, webmaster@somehost, bugs@oacs, current-person-in-charge@office, etc. It is not a good idea to use such email addresses as "personal email addresses" , but to provide email addresses for groups/communities, where the people in question can be added to, such that incoming emails are distributed accordingly. It would be actually quite convenient to to use dotrln communities as a mail reflector...

in your example, make a group/community info, subscribe the people in question to it, and set the reply-to address for outgoing mails. doesn't do forums something similar already?

just my 2 cents...

Collapse
Posted by Don Baccus on
"This is why most sites today ask for a username and not an e-mail because other systems already detected the fact that e-mails are not unique."

This is an unfounded statement from authority, Malte. Hand-waving just does little to convince me.

"Therefore I'd encourage developers to be *really* careful with unique constraints and make a check if this unique constraint is reflected in the real world. Usually it isn't"

This is just as bad. Nothing but hand-waving.

Now ... you still haven't answered one of my original questions. If two people have the same e-mail and the system's set up to allow login by e-mail, which person do you log in if the e-mail's not unique?

Flip a coin?

The spam issue mentioned by Dave is one I thought of driving to the bay area today.

If we remove the constraint and things break, who will fix it? You? History doesn't back up this presumption.

Sorry if cleaning up the Oracle messes on the floor that delayed 5.2 so long has left a sour taste in my mouth, but why shouldn't it?

Disclaimer: This discussion has been going on way to long and it has already turned nasty to the point were I just gave up discussing it any further. But as Don made comments in a dissmissive way about my believes and work in public I feel personally offendend and want to ractify. I just hope that this does not result in a flame war as past experience has at least shown to me that this kind of thing is bad for the community, especially if happening between the founder and one of the most active members.

Now let's get started 😊

Let's put it this way. I have three clients with user data of each around 20.000 customers with 40.000 contact people. Within these I find that we have around 1000 contacts who are using the same e-mail address. These are distinct people (read: parties).

The unique constraint on emails is on *parties*, which is a table, if I'm not mistaken, that has *nothing* to do with users, because groups and organizations could be parties as well. So, to answer your question: If two *PARTIES* have the same e-mail, I use *username* in my system to log them in. Which is my whole point. Do not use email for the login of *users* but username. Username is part of the users table, emails is part of the parties table. There are more parties than users in the system, so it does not make sense to limit the parties in my opinion due to the fact that you need a uniqueness on login, which you are giving to real person (aka users).

As for the Oracle delay, would you mind publicly stating which CVS commits from myself you refer to with regards to fixing Oracle? You keep on telling that I'm responsible for delaying the 5.2 release when I thought we established that the code change, though made after the code freeze and therefore a mistake on my behalf which I already apologized for, has not caused any delays in terms of broken code.

Taking that my memory might be bad in that the previous point I just wanted to make that the reason for this exchange (a three line code change...) has actually the capabilities of the dire consequences of breaking code which I wont be able to fix.

http://xarg.net/tools/cvs/change-set-details?key=22705

So, maybe I am exaggerating a little bit (which is different from hand waving, unless my command of the english language is not good enough). But I stand by my point that we have more unique constraints than we need.

E-Mail is not unique. There are families who use the same email, there are employees who use the same email, there are villages who use the same e-mail. Therefore having a unique constraint on e-mails while the open world actually prooves that more parties (organizations and persons) use the same e-amil is a bad idea. If you force users (note the difference between person and user) to use their e-mail address for login, obviously the unique constraint on e-mail makes sense. I'd even go so far and agree that the email of a user (person which has a password to login to the community) in the system should be unique, to avoid the spam problem, though I still wouldn't enforce it (why do we have the username).

Oh, and while the box of pandora is already open who says that users in the community system have to have an e-mail? If I want to participate in the forums and make assessments and have my dotlrn portal, I do not need an e-mail address. I need a login, and the email makes it easier to communicate with myself, but why should I be forced to provide the email or, for that matter, why should any user of OpenACS force his users to provide e-mails (this refers to "not null" constraints).

Another unique constraint which we had been banging our head against quite often in the last months is the unique constraint on cr_item name with parent_id, though I still agree this is a good idea.

Here is the szenario though:

- Original file is uploaded
- User uploads a first translation with the same filename (e.g french)
- User two uploads a second translation with the same filename (e.g. spanish)

As we do not have a unique constraint on cr_item, locale, parent_id we are forced to either

- Use separate folders for the files (of each languague)
- Rename the name of the file

There have been some other issues (multiple revisions of a file where you need to know which revision was following what earlier revision, resulting in us not using revisions in the first place but different folders).

Note: I'm not accusing anyone of making a wrong decision (even if I say that it has been a bad idea), I just note that realities of day to day development driven by client needs has shown that some assumptions should better not have been made.

So my warning / plea is solemnly: If you make a unique or not null constraint in the datamodell, make sure that your assumption is correct and can be remodelled in the real world, as changes in the datamodell and application once you made this assumption, are painful, result in discussions like this and most likely wont be made in the first place or not committed back, resulting in forks all over the place.

Collapse
Posted by Torben Brosten on
Malte, the "zz-email-not-unique" seems like a nice dotlrn-way of handling the change.

It is not really necessary to understand how people collectively manage a resource, such as one email address --just that it happens all the time all over the world. Why do we need to know if people use 2+ clients for 1 email, or a web interface, filters and maybe a little cooperation in manually managing emails?

OpenACS should consider it part of the behavioral norm as much as any other community user variation.

Constraints for corporate/legal/privacy reasons does not hold. The first person who registers to an OpenACS site using a shared email results in similar circumstances to when multiple parties register with the same email. In each case, an email potentially destined to one person may be read by more than one. How can the site owners be responsible for that?

Certainly ecommerce should have strict requirements. But might it be reasonable to assume that those sharing an email might also be sharing other private info?

How to enforce privacy when multiple parties share the same email?

Use a web-based intra-OpenACS community email package to convey email subject/content, and send a generic notification to the user's email address. I believe a package like that is in development.

Maybe the zz-email-not-unique code could be added to the intra-site-messaging package.

Sites that want to keep the constraint might be satisfied with by just creating a parameter somewhere that automatically sets username to email_address on registration etc.

Collapse
Posted by Torben Brosten on
How can "the people" subscribe to a community if they all use the same email address and there is constraint on users that their email addresses must be unique?
Collapse
Posted by Don Baccus on
"So my warning / plea is solemnly: If you make a unique or not null constraint in the datamodell, make sure that your assumption is correct and can be remodelled in the real world, as changes in the datamodell and application once you made this assumption, are painful, result in discussions like this and most likely wont be made in the first place or not committed back, resulting in forks all over the place."

At the risk of offending Malte even further, I might suggest that he follow his own advice and to be even more cautious when removing or changing fundamental aspects of the datamodel and code base that has served so many people so well for so many years.

I'm not opposed to well thought-out changes However willy-nilly changes that are made because of a one-off problem with a client, without public discussion of the problem and possible solutions, are a problem. And such changes should not be made unless the person proposing and making them are willing to spend the time to track down potential problems and to make the commitment to fix them afterwards.

Collapse
Posted by Don Baccus on
So if you have this case where a bunch of users have "info@company" as their e-mail rather than individual e-mail ... is there any particular reason why you can't simply set their e-mail to "null"? They can still log in with their username in this instance, right?

After all, if "mailto:oct@openacs.org"; were to be my e-mail address in an OpenACS instance, I would be unhappy if I were to get 10 copies of every alert sent out for a forum post.

Wouldn't you?

ISTM that while getting rid of the unique constraint might be the SIMPLEST solution to your problem, it can hardly be considered the BEST solution. Your solution might make the person who has to read "info@company" unhappy if there are 100s of employees making use of it as their "token" e-mail.

The datamodel doesn't presume that every person will have a unique e-mail address. It assumes that shared e-mails form a group, a party, that (say) "info@company" belongs to the company itself, not any individual.

Perhaps this isn't practical but it does allow modelling of real world scenarios. The datamodel allows for group as well as individual e-mails, in other words. The toolkit doesn't support it well out of the box, and perhaps the datamodel might be more awkward than you'd prefer, but you are WRONG to flatly claim that the datamodel doesn't support this real-world scenario.

As far as process goes, Malte, my issue here is that there's a tendency to commit changes that avoid roadblocks that follow the path of least resistance - the simplest change to solve your personal problem - rather than raise flags and ask for discussion.

If you had approached this by first posting "hey, I've got a problem, any ideas on how to best solve it?" my reaction would be a bit different.

Likewise the addition of the additional table after code freeze. I am in no way convinced that you and/or Timo have solved the problem in the best way, but given that this core datamodel change was made at (past) the last minute without any prior discussion was not even been given the chance to think about the problem.

If it were just me it would be no problem, but given Daveb's reaction (a man whose life seems to be spelled "IRC") it seems pretty clear that changes of this sort are being made without discussion with anyone.

That's not good. The plea for public discussion before such changes are made rather than afterwards when folks read your commit message isn't meant to make it painful for you to make changes. Rather it is meant to make life less painful for us as a community by discussing fundamental issues first, while making changes later.

The data modell does not support the real world example that an organization can have the same e-mail address as an individual within this organization at the same time. Due to this constraint all applications are forced to write some code which allows them to find the best e-mail address to use. In contacts we do this for employees by using the nasty hack of saying "if e-mail empty, check for employer e-mail and use this one". Though this spells trouble for the users who work at two different organizations but have a prefered organization e-mail. Dropping the unique constraint allowed us to do exactly this.

But anyway, I will not bring up a TIP but use this in our own patch which we just don't commit (as this has proven the best way for other companies who have the need for changes and don't have the luxury of lengthy debates). Which should solve the process problem from our side as, if we accidently make a commit which you feel should be discussed, we immediately remove it and put it into a patch. No big deal. If you feel this puts an unnecessary burden on the community (as obviously I disagree with your interpretation of the TIP rules and therefore am prone for running into this conflict again and again), feel free to ask us not to commit anything to core anymore, then we at least know where we stand.

As you mentioned the issue with application data links, feel free to start a discussion on that and we will make our point why we did it this way (as we did earlier, but I don't think that a list of IRC logs and forums posting links would change your perception). If you don't like then we will just put it into a seperate package as it would be totally unfair if we put this into a patch as nearly all packages we have been working on rely on that API [yeah, I know, very *evil* and sneaky behaviour of making a core change and making sure that removing it will render 5 packages useless 😊].

Collapse
Posted by Jon Griffin on
The problem is really that parties has email, where at the least it should also be in users. You can have it as null as it is now, though that isn't a great solution either. Then you can send email to a user as opposed to a party (assuming you also keep the one in parties).

By having email in parties, you have implied that a party is an entity with email (what about groups which use parties for their relations?)

This really isn't a difficult patch. I already have it. The biggest problem is making sure that existing code that may depend on this faulty assumption about parties having email don't break.

For the record, I don't volunteer to do this, but I at one point had a site that required anon logins and optionally set an email if the user wanted it. That is where this came from, multiple entities with the same email are a similar datamodel issue, but not the motivation for my patch.

--Dave,
RE: sending email. If the model was changed and in addition to email you could use handle as a login, if the people don't put in the right email, what can you do. This is still an issue today. Also, if it is null, you can't login, so you would have to hack the pages to allow handle as an auth type. You can fairly easily stop multiple emails from going out, but unless you set verify email, I can make up email logins all day. How many mailto:bill@microsoft.com or similar do you have in your sites.

That is the real reason that most sites now do not require email