Monday, February 20, 2012

Null Foreign Key

I'm having an argument with an incumbent self-taught programmer that
it is OK to use null foreign keys in database design.

My take is the whole point of a foreign key is that it's not supposed
to be optional, it's very definition is it's a necessary link to the
parent table and part of the definition. If it's optional it shouldn't
be part of the definition of a table and should be in a linking table
instead.

Comments?In general it's good practice to keep NULLs to a minimum in any database
schema. However there's nothing fundamentally wrong with having a nullable
Foreign Key constraint. One example where it's commonly required is an
adjacency list hierarchy such as a table of employees and their managers:

CREATE TABLE Employees (emp_nbr INTEGER NOT NULL PRIMARY KEY,
manager_emp_nbr INTEGER NULL REFERENCES Employees(emp_nbr), emp_name
VARCHAR(30) NOT NULL UNIQUE);

manager_emp_nbr is the FK but it will be NULL for the President, Chief Exec
or other individuals at the root of the tree.

--
David Portas
SQL Server MVP
--|||Bodza Bodza (bodzaxx@.hotmail.com) writes:
> I'm having an argument with an incumbent self-taught programmer that
> it is OK to use null foreign keys in database design.
> My take is the whole point of a foreign key is that it's not supposed
> to be optional, it's very definition is it's a necessary link to the
> parent table and part of the definition. If it's optional it shouldn't
> be part of the definition of a table and should be in a linking table
> instead.

I agree with your programmer. There are those who hold the ideal that
there should be no nullable columns in database, but if you move all
optional columns into linking tables, you will drown in tables.

Say that you in your business process you have an Order. An Order has a
customer - he is probably known from the beginning. There is an employee
who is responsible for assembling the order - but he is not known until
the order is assigned to someone. The Order is later shipped, and added
to a shipment, and thus there is a Shipment ID. At the same time there
is an invoice ID. Yet, the customer pays, and now you have a payment ID.
If you insist on linking tables we are now up to four. Since it would
be an error to have a payment ID, if there is not an invoice ID, you need
enforce that restriction with a trigger, rather with a CHECK constraint
on the table.

Of course if we are talking about the OrderDetails table, the OrderID
in this table should be NOT NULL - but that is because OrderID is
part of the primary key of that table.

--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||"David Portas" wrote:

> manager_emp_nbr is the FK but it will be NULL for the President, Chief Exec
> or other individuals at the root of the tree.

Hmmm. Personally in this example I would say that the individuals at
the root of the tree are their own managers. Give me a better example
to convince me.|||Erland Sommarskog <sommar@.algonet.se> wrote:

> I agree with your programmer. There are those who hold the ideal that
> there should be no nullable columns in database, but if you move all
> optional columns into linking tables, you will drown in tables.

I take it you prefer denormalization to normalization then?

> If you insist on linking tables we are now up to four.

And that would be a problem because?...

Your point about not knowing the employee key at the time of the order
is like modelling someone's current state of health: should the
current state of health have a nullable DIEDON key because part of the
life process is to die at a later date? No way Jose, the DIEDON key
should be in a separate table. Likewise if you can't assign the
foreign key to the table at the time, then it's NOT a foriegn key: if
you insist a nullable column should be in the table, then it's just
not good practice to allow it to be nullable. Otherwise, why not just
forget about referential integrity alltogether and just enforce
"integrity" in the application. Sorry I don't buy it.

> Of course if we are talking about the OrderDetails table, the OrderID
> in this table should be NOT NULL - but that is because OrderID is
> part of the primary key of that table.

Right: In 99% of the cases foreign keys *are* derived from primary
keys which are by definition NOT NULL. In the case where you are
deriving a foreign key from simply a unique column then, sure you can
have a null in the unique column, but that's the same as entering a
dummy value into the unique column to reference it. Sorry, but I've
been designing databases since SQL 6.0 and to my eyes your
rationalization is ugly as hell.|||Bodza Bodza (bodzaxx@.hotmail.com) writes:
> Erland Sommarskog <sommar@.algonet.se> wrote:
>> I agree with your programmer. There are those who hold the ideal that
>> there should be no nullable columns in database, but if you move all
>> optional columns into linking tables, you will drown in tables.
> I take it you prefer denormalization to normalization then?

No, that is a forgone conclusion. My approach to data modelling is
foremost practical. Denormalizing by duplicating columns can lead to
performance issues. On the other hand, keeping aggregated tables with
the sum of transactions is almost a necessity in the system I work with.

>> If you insist on linking tables we are now up to four.
> And that would be a problem because?...

Because the logical information is scattered over four tables, which
makes it more difficult for the programmer to understand and use, and makes
it more difficult in support situations to view the data. I don't
know how big your data models are, but the system I work have over
400 tables, and clarity for the developers is important.

> Your point about not knowing the employee key at the time of the order
> is like modelling someone's current state of health: should the
> current state of health have a nullable DIEDON key because part of the
> life process is to die at a later date? No way Jose, the DIEDON key
> should be in a separate table.

Dates are a good examples of why should not split into several tables.
In our database there is a table for sendouts. It looks something like
this:

CREATE TABLE sendouts (accountid int NOT NULL,
documentid int NOT NULL,
addressid int NOT NULL,
fromdate datetime NULL,
enddate datetime NULL,
CONSTRAINT pk_sendouts PRIMARY KEY (accountid, documentid, addressid),
CONSTRAINT ckc_dates CHECK (NOT (enddate < fromdate)),
CONSTRAINT fk_accountid FOREIGN KEY (accountid)
REFERENCES accounts (accountid),
CONSTRAINT fk_documentid FOREIGN KEY (documentid)
REFERENCES documents (documentid),
CONSTRAINT fk_addressid FOREIGN KEY (addressid)
REFERENCES addresses (aaddressid)
)

An account may have one or more sendounts, typically at least two different
documents: contract note and account statment. A document can be sent to
more than one address for the same account.

Most commonly a sendout applies as long as the information is there,
in which case both fromdate and enddate are NULL. But an account
owner may want extra copies to his summer house for a limited period.
Or an account owner may be under age, and copies are to be sent to his
guardian until his 18th birthday. Thus, in this case fromdate is
NULL, but enddate has a value.

So, if we don't want any NOT NULL columns, we would need one extra
table for each column. And that would be two four-column tables, unless
we introduce a surrogate key for sendouts.

And how would we use this table? Well, if we want access the information
to display in the maintenance form for sendouts, the most natural would
be to have one SELECT, and join the two subtables with LEFT JOIN. In
which case the NULLs reappears. Of course, you could have one stored
procedures per table, but that would be ineffecient, and you would
leave it to the client to perform the join to reassociate the data. And
the client still have to keep track of a concept of "not applicable".

So instead of bending over backwards to avoid NULL in the database,
it appears a lot simpler to have the columns in one place from the
beginning. After all, the NULLs are there already in the business rules.

> Likewise if you can't assign the foreign key to the table at the time,
> then it's NOT a foriegn key: if you insist a nullable column should be
> in the table, then it's just not good practice to allow it to be
> nullable. Otherwise, why not just forget about referential integrity
> alltogether and just enforce "integrity" in the application.

Referential integrity and nullability are orthogonal issues to each
other. NOT NULL means that there must be a known value. Referential
integrity means that if an attribute has a value, it must be defined
in another table. None of the concepts imply the other.

>> Of course if we are talking about the OrderDetails table, the OrderID
>> in this table should be NOT NULL - but that is because OrderID is
>> part of the primary key of that table.
> Right: In 99% of the cases foreign keys *are* derived from primary
> keys which are by definition NOT NULL. In the case where you are
> deriving a foreign key from simply a unique column then, sure you can
> have a null in the unique column, but that's the same as entering a
> dummy value into the unique column to reference it.

But you are mixing apples and oranges. OrderID in OrderDetails is NOT NULL
because it's part of the primary key in *OrderDetails*. The ProductID
in OrderDetails is NOT NULL, because business rules says that an order
detail is about a product. But CampaignID is nullable, because not all
order items are bought because of an offer from a campaign.

> Sorry, but I've been designing databases since SQL 6.0 and to my eyes
> your rationalization is ugly as hell.

Of course, database design is an area where you often can do things in
more than one way, and it's up to subjective opinion which you prefer.
However, the model you seem to advocate, when take to its extreme, would
lead to data models that are difficult to use, and not always that
effecient.

--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Erland Sommarskog <sommar@.algonet.se> wrote:

> No, that is a forgone conclusion. My approach to data modelling is
> foremost practical. Denormalizing by duplicating columns can lead to
> performance issues.

Agreed, it is also in some cases practical, especially if reporting.

>On the other hand, keeping aggregated tables with
> the sum of transactions is almost a necessity in the system I work with.

Are you using a data warehouse or is your system primarilly used for
reporting, by any chance?

> >> If you insist on linking tables we are now up to four.
> > And that would be a problem because?...
> Because the logical information is scattered over four tables,

Which increases the speed of updates because there is less data and
thus increases concurrency, which in my case is key because I want to
handle thousands of users.

>which makes it more difficult for the programmer to understand and
use,

Please! It's not a good reason to denormalize table because the
programmers are too dumb or too sloppy to understand! Send them on a
training course or train them yourself. Come on!

>and makes it more difficult in support situations to view the data.

So write views to simplify...

>I don't
> know how big your data models are, but the system I work have over
> 400 tables, and clarity for the developers is important.

It doesn't matter the size of the data-model, I've written systems
with 500 tables and I've written systems with 20 cases. In neither of
the two extremes would I use null foreign keys or cut corners because
of "clarity". And I take issue with your use of the phrase "clarity"
when badly designing a database.
Sorry, I don't buy the argument that you should cut corners in the
database model because your developers don't understand.

> So, if we don't want any NOT NULL columns, we would need one extra
> table for each column. And that would be two four-column tables, unless
> we introduce a surrogate key for sendouts.

I have no argument with nulls, just with null "foreign" keys.
You want to have null end-dates, be my guest. Having optional foreign
keys just because you don't know you can put them in a linked table is
ugly in my opinion.

> So instead of bending over backwards to avoid NULL in the database,
> it appears a lot simpler to have the columns in one place from the
> beginning. After all, the NULLs are there already in the business rules.

Agreed. I *have* nulls in my designs. Just not in foreign keys.

> Referential integrity and nullability are orthogonal issues to each
> other. NOT NULL means that there must be a known value. Referential
> integrity means that if an attribute has a value, it must be defined
> in another table. None of the concepts imply the other.

Agreed: You *can* have a foreign key which references a UNIQUE NULL
column in a foreign table which could have a single null row in it,
but what's the point of that? That's the same as creating a dummy
value which means "I don't know". My argument is that the whole point
of a foreign key is that it should be drawn from a KNOWN list. If it
can be unknown then you shouldn't call it a "foreign key".

> But you are mixing apples and oranges.

No, I'm not: You think I am because you have missed my point. Let me
spell it out for you; Read the title of the thread again. It doesn't
say "nulls are ugly". IT says "null foreign key".

>OrderID in OrderDetails is NOT NULL
> because it's part of the primary key in *OrderDetails*. The ProductID
> in OrderDetails is NOT NULL, because business rules says that an order
> detail is about a product. But CampaignID is nullable, because not all
> order items are bought because of an offer from a campaign.

But if you are going to later lookup CampaignID afterwards from a list
of known values then there's a pretty good argument for moving the
campaignID out into OrderDetailsCampaign when

CREATE TABLE OrderDetailsCampaign(
OrderDetailID int NULL,
CampaignID int NOT NULL,
FOREIGN KEY (CampaignID) REFERENCES (Campaign),
PRIMARY KEY (OrderDetailID))

Sorry, this is just cleaner in my opinion.

> Of course, database design is an area where you often can do things in
> more than one way, and it's up to subjective opinion which you prefer.

Except that in some cases (e.g. null foreign keys) it's not
subjective.

> However, the model you seem to advocate, when take to its extreme, would
> lead to data models that are difficult to use, and not always that
> effecient.

No, on the contrary. A well normalized database is only difficult to
use if you don't understand relational databases, moreover
normalization leads to greater concurrency and can handle many more
users for a variety of reasons. Look them up if you're unaware.

On the other hand if you are looking for greater query speed you can
denormalize the database to your hearts content and index the hell out
of it. *Although* if you do this, it is at the expense of concurrency
for those who actually want to insert or update data. In the best of
all worlds, you have a normalized design for the OLTP database and a
denormalized design FED FROM the OLTP database for reporting.|||Bodza Bodza (bodzaxx@.hotmail.com) writes:
> Are you using a data warehouse or is your system primarilly used for
> reporting, by any chance?

No. The system I work with is mainly OLTP.

>> Because the logical information is scattered over four tables,
> Which increases the speed of updates because there is less data and
> thus increases concurrency, which in my case is key because I want to
> handle thousands of users.

Huh? Instead of inserting/updating one row in one table, you need to
access four rows in four different tables. This means that you need
a user-defined transaction, and hold locks on the first table for a
longer time. Note that in the case of an update, you may need to delete
rows from the subtable if that attribute is set to NULL.

>>which makes it more difficult for the programmer to understand and
> use,
> Please! It's not a good reason to denormalize table because the
> programmers are too dumb or too sloppy to understand! Send them on a
> training course or train them yourself. Come on!

There is no training course that can teach them our data model. Which is
complicated enough already. I try to document as good as I can, but in
our small company, I have to rely on that people can work independently.

If a column is in the main table, it is visible, and you may realise
that you need to handle it in some way. If it is in a subtable, it may be
overlooked and forgotten.

And, by the way, this has nothing to with denormalization. Denormalization
is when you introduce redundancy - and this is not about redundancy.

>>and makes it more difficult in support situations to view the data.
> So write views to simplify...

And the views will have foreign keys in them that can be NULL. What's
the point?

> It doesn't matter the size of the data-model, I've written systems
> with 500 tables and I've written systems with 20 cases. In neither of
> the two extremes would I use null foreign keys or cut corners because
> of "clarity".

If I were to follow your advice and put all my nullable foreign-key
columns in sepearate tables, I would probably have over 1000 tables.
Many with one single column beside the primary key.

> I have no argument with nulls, just with null "foreign" keys.

But there is nothing special with foreign-key columns. A NULL in any
column means that the value is not applicable or unknown. Let's say
that I have:

CREATE TABLE whatever (someid int NOT NULL PRIMARY KEY,
...
error_reason char(1) NULL,
CHECK (error_reason IN ('A', 'B', 'C', 'D'))

Then one day I get tired of changing the table when a new possible
value for error_reasons is invented, so I decide to make a table
error_reasons, and change whatever to:

CREATE TABLE whatever (someid int NOT NULL PRIMARY KEY,
...
error_reason char(1) NULL,
REFERENCES error_reasons(error_reasons)

There is no logic why I suddently should move this column to a new table,
just because the constraint on the column is now a foreign-key constraint
and not a check constraint.

As I said in my previous post, nullability and foreign-keyness are
orthogonal issues.

> Agreed: You *can* have a foreign key which references a UNIQUE NULL
> column in a foreign table which could have a single null row in it,
> but what's the point of that?

Particularly, since if a column is NULL, the constraint is not checked
anyway.

> My argument is that the whole point of a foreign key is that it should
> be drawn from a KNOWN list. If it can be unknown then you shouldn't call
> it a "foreign key".

And I am saying that you are wrong. A foreign-key column is just another
attribute of the object described by the table. The only difference is
that this particular attribute is a pointer to another object. But there
is no reason why a pointer should be different in nullability and any
other attribute.

> But if you are going to later lookup CampaignID afterwards from a list
> of known values then there's a pretty good argument for moving the
> campaignID out into OrderDetailsCampaign when

I am sorry, but I can find absolutely weight in this argument. There is
no problem with looking up the campaignid, even if some values are NULL.
Ah, you have to know how to to an outer join - but so you need to with
your way of doing anyway.

> No, on the contrary. A well normalized database is only difficult to
> use if you don't understand relational databases, moreover
> normalization leads to greater concurrency and can handle many more
> users for a variety of reasons. Look them up if you're unaware.

I sincerly doubt that you gain concurrency by splitting up data for an
objects in more tables. As I said, your transactions will be longer. And,
if your programmers choose different access orders, you increase the
risk for deadlock.
--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Erland Sommarskog <sommar@.algonet.se> wrote:

> > [normalization]increases the speed of updates because there is less data and
> > thus increases concurrency, which in my case is key because I want to
> > handle thousands of users.
> Huh? Instead of inserting/updating one row in one table, you need to
> access four rows in four different tables.

Four smaller narrower tables. That is the whole point which you seem
to have missed.

> If a column is in the main table, it is visible, and you may realise
> that you need to handle it in some way. If it is in a subtable, it may be
> overlooked and forgotten.

Yuck.

> And, by the way, this has nothing to with denormalization. Denormalization
> is when you introduce redundancy - and this is not about redundancy.

No you're right, it has to do with sloppy database design. Your reason
is bogus.

> If I were to follow your advice and put all my nullable foreign-key
> columns in sepearate tables, I would probably have over 1000 tables.
> Many with one single column beside the primary key.

Which is the sign of a well normalized database. Lots of small narrow
tables.

> The only difference is
> that this particular attribute is a pointer to another object.

This is the crux of my disagreement. A foreign key is a relationship
in a database which says (this must be a value which is in an existing
list).
It's NOT a pointer.

> I sincerly doubt that you gain concurrency by splitting up data for an
> objects in more tables.

Oh really? The database has way LESS data to table scan and you say it
will take more time to search?
I suggest you do some reading on normalization and concurrency.|||Bodza Bodza (bodzaxx@.hotmail.com) writes:
> Erland Sommarskog <sommar@.algonet.se> wrote:
>> Huh? Instead of inserting/updating one row in one table, you need to
>> access four rows in four different tables.
> Four smaller narrower tables. That is the whole point which you seem
> to have missed.

It is correct so far that fixed-width columns take up the same space,
no matter they are NULL or not. Therefore, if you have a couple of columns
that are NULL for most rows, it may be a good idea to move these to a
side table, as this can reduce the row size of the main table, and make
it faster to access. But while foreign-key columns often are fixed-width,
it not really much to do about foreign keys as such, but about nullable
columns.

And more importantly, you are not saving a lot of space, by moving out
nullable columns are that are most often non-NULL to side tables, least of
all one per column. This means that for the same row, the primary key value
may be duplicated in four tables. For a multi-column PK, this cost
can quickly outrun what you gain by not using any space to save the NULL
values.

>> If a column is in the main table, it is visible, and you may realise
>> that you need to handle it in some way. If it is in a subtable, it may be
>> overlooked and forgotten.
> Yuck.

Do you know about cognitive science? One of best friends got her Ph.D.
in this subject, and while I have not studied it, it really a useful
subject. By understanding how people understand things, you can make
things simpler. This may apply to control boards for lights in a conference
room, a GUI and even a data model.

>> And, by the way, this has nothing to with denormalization.
>> Denormalization is when you introduce redundancy - and this is not
>> about redundancy.
> No you're right, it has to do with sloppy database design. Your reason
> is bogus.

As I said, whether to do it this way or that way, is a matter of
subjective opinion. After all, you came to this newsgroup to ask for
advice, and you got it, even if the answer was not the one you expected.

>> The only difference is
>> that this particular attribute is a pointer to another object.
> This is the crux of my disagreement. A foreign key is a relationship
> in a database which says (this must be a value which is in an existing
> list).
> It's NOT a pointer.

It must be a value in an existing list when it is non-NULL. And this is
what you fail to understand. You seem to think that NULL must refer to
something as well. But it only tells you that this row does not have
this kind of reference.

>> I sincerly doubt that you gain concurrency by splitting up data for an
>> objects in more tables.
> Oh really? The database has way LESS data to table scan and you say it
> will take more time to search?

Depends. As I said above, the total amount of space of the split-up
tables may be larger than the size of a single table.

And, as I pointed out earlier, if you need to insert rows in four
tables instead one, that will take longer time, and must be in the
scope of a transaction. Certainly not good for concurrency. Furthermore,
if different pieces of good uses difference access order, you increase
the risk for deadlocks.

> I suggest you do some reading on normalization and concurrency.

I like to insert the caveat that reading books about database theory
may lead you astray when working with a certain DB engine. What may be
a good thing for performance in Oracle, may be a bad thing in SQL Server
and vice versa. It may even very between SQL Server versions. For instance,
moving only fixed-width columns that are often NULL can be a good idea
in SQL 7 and SQL 2000, but was a meaningless thing to do in SQL 6.5

In the end, if you really want to know which of two alternatives that
are best from performance, you have to benchmark them.

--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Erland Sommarskog <sommar@.algonet.se> wrote:
> In the end, if you really want to know which of two alternatives that
> are best from performance, you have to benchmark them.

Agreed. Amen to that.

No comments:

Post a Comment