Although most would not consider omniscience a reasonable requirement for a software system, it’s nevertheless a common complaint that computer systems are not omniscient. It’s not often worded quite that way, but we’re all familiar with the argument that a system (say, a database) must be able to represent a fact that it doesn’t contain, or rather, to represent missing values somehow.
That “somehow” normally takes the form of a special symbol, ω (omega), or “null”. This special symbol is assigned to variables that have not yet been initialized, that have no meaningful value in that context, or whose value is unknown, etc.
Now, let’s take a step back to reality. In reality (the world that’s actually out there), things either are or are not. Reality doesn’t care if we know about it or not, and there are no facts which aren’t facts about something in particular. And there are certainly no existents out there which are “not yet initialized”.
To be clear, the issue we are talking about pertains to variables in whatever language we’re using. A variable properly has some value, but may have any value (of the appropriate type). The notion of null is a variable which has no value at all. It’s my opinion that this amounts to a philosophical error known as reification of the zero.
Implications of Null
A software system models some abstraction from reality. To model the facts of reality (or a virtual reality), a very simple logical system suffices, since things either are or are not (law of the excluded middle). A simple enough system represents only facts about reality (or, rather, abstractions from such facts), and those things which aren’t facts about reality are simply omitted. This is not to say that it represents all facts of reality (that would be absurd), but only the ones about which we have something to say in the context of our system.
For a simple but contrived example, let’s make a kind of language that represents facts about people’s birthdays. We have a type of expression saying “The birthday of Person p is on Date d.” Call it a logical theorem that there exists some person p whose birthday is on some date d. To prove this theorem, one merely needs to supply some (but any) person who has some (but any) birthday. According to the Curry-Howard isomorphism, we could think of it as a function, a constructor of values of a specific type. Call the type and the constructor function BD, and let it take two arguments, p and d, which represent a person and a date, respectively, yielding values of the form BD(p,d).
Now comes the question of what to do when a fact is not yet known, not applicable, or can’t be found out (because of some error, for example). There are some options for representation here, of which, I will consider two. For example, in a system that tracks people’s birthdays, we could handle unknown birthdays in two ways:
- State that the fact is unknown, using a new kind of expression: “Person p has an unknown birthday.” If this were a programming language, it would be something like a function UB which takes a value p and returns a value of the form UB(p). It’s like a logical formula with one variable p, such that any value for p provides a proof of the hypothesis that there exists some person whose birthday is unknown.
- Create a special null value called ω, such that for an unknown birthday, we would simply say: “The birthday of person p is on date ω“, or: BD(p, ω), for some p. Or if the person, whose birthday it is, is unknown: BD(ω, d), for some d.
These are different approaches to the problem in a profound way. In the former, we have stayed within the framework of representing “that which is”, i.e. truths only, and used the semantics of the language to represent facts about the unknown (the fact that it is unknown). In the latter, we have imported the notion of the unknown, from the meta-level, into our algebra. As far as our algebra is concerned, ω is a kind of something that is nothing in particular and has no specific type. It has become a way of formalizing vagueness in our logic. We have in fact changed the logic of our language in such a way that the Curry-Howard isomorphism no longer holds, and the law of excluded middle is no longer applicable. Supplying null for either of the variables of BD(p,d) creates a value of the form BD(p,d) that is not a proof of the logical formula that BD(p,d) represents. Observe that “The birthday of person John is on date ω” does not constitute a proof of the hypothesis that there exists some person p whose birthday is on some date d.
In short, the difference between these two approaches is that the former treats the fact that something is unknown as just another fact, using grammar that already exists, while the latter is an attempt to formalize “unknown” as a new grammatical construct.
But once we invite null into the system, we have to come up with the semantics for how to handle it. We have to ask the question: “What exactly do we mean by null?” For every operator, we must decide what it means for one of its operands to be null. What does ω = ω yield? In some languages (most imperative programming languages), ω is considered equal to itself. In others, it’s considered an error to even ask the question, resulting in a runtime error or a non-terminating program. In yet others (SQL), such an operation always yields ω itself. This last kind is a kind of three-valued logic, which is fraught with difficulties. For example, in SQL, A=A is not true for all values of A. In other words, we have arrived at a logic in which the law of identity is not a tautology!
The implication of this last point is that a Boolean expression in such a system is to be considered valid even though it is neither true nor false. This means we can’t consider such a system to deal with propositions, since all valid propositions are either true or false (a proposition that is neither true nor false is arbitrary, i.e. invalid). From the standpoint of propositional logic, then, a database is corrupt to the extent that it contains nulls.
In fact, once we allow it, we may find that we need more than one kind of null, with differing semantics depending on what we mean by it in that context. One kind may mean “unknown”, another might mean “inapplicable”, and of course we might need a third to say that “we do not know if this is applicable or not”. When taken to its logical conclusion, this leads to an explosion of kinds of null. As a concrete example, consider a database that tracks people’s death dates. For the predicate Person p has a death Date of d, one might need at least three kinds of null: one to say that the person is not yet dead (inapplicable), another to say that the person is dead, but we don’t know when they died (unknown), and a third to say that we don’t know whether or not they have died (unknown whether or not a value is applicable). And this is just one really simple example.
But in practice, the meaning of null is often decided ad hoc by the user of the system or language, and may even not be decided at all, in which case the result of a program employing nulls has been consigned to chance to some degree.
It should be abundantly clear that introducing null into a language is not much of a solution to anything. We don’t do that with natural language, and certainly not in formal logic, so why do it with programming languages? I believe it’s partially because people are used to thinking of a programming language as a machine rather than a logic. They are still thinking in terms of initializing pointers or registers and transitioning between memory states. But the power of languages is to abstract these things away, so that we can concentrate on the algebra while the machine does the arithmetic. Yet, I think we owe the acceptance of null largely to the fact that people are used to sloppy thinking. They’re used to treating nothing as if it were a thing, and using concepts of which they only know the approximate meaning.
Fortunately, there are real-world solutions for obviating the notion of null. Functional programming, for example, does away with it by employing monads. Haskell has the Maybe monad which is an abstraction from case analysis such that a value of type Maybe A can take one of two possible forms: Nothing or Just a where a is a value of type A. This might seem like it’s the same thing as null, but the fact that only certain operations are defined on the Maybe type class (i.e. it’s a monoid) keeps everything well defined. This concept is applicable to any language that supports functional style, and can even be implemented in Ruby. In some other languages, this monad is called Option.
In the realm of databases, work is being done on more abstract database languages than most of us are used to. Chris Date and Hugh Darwen have been instrumental in putting database systems on the sound theoretical footing of predicate logic, although products based on their research are largely in the early stages of development.
Wonderful reading, thanks for sharing. The “death dates” example is excellent.
> In fact, once we allow it, we may find that we need more than one kind of null, with differing semantics depending on what we mean by it in that context.
True, and easily solved. If you really need to know why a column/variable is null, just add another column/variable with an enumeration of possible reasons.
> True, and easily solved. If you really need to know why a column/variable is null, just add another column/variable with an enumeration of possible reasons.
Reductio ad absurdum: Why not create the entire design using just one column/variable (I prefer “term” in this context) that is always null, and then add other terms with an enumeration of the possible meanings of that one null?
A lot of design problems are easily solved, but it’s easier still to not have those problems in the first place.
Thank you for this condemnation of SQL’s NULL. I particularly like the observation that in SQL’s corruption of logic the law of identity is not a tautology. Only yesterday I had been thinking of making that very same observation in response to somebody who was complaining that NULL isn’t taught in the relational theory part of a certain university’s database course. (I had been a consultant to the team that developed that course–as a revision of a previous course on the same subject–and had urged them not to mention NULL until the part of the course that teaches SQL.)
> Thank you for this condemnation of SQL’s NULL.
No, thank you, Hugh, for shedding light on this for me in the first place. In fact, “the observation that in SQL’s corruption of logic the law of identity is not a tautology” is your observation which I lifted from part 4 of your “Setting The Record Straight” series in response to a certain git.
I absolutely agree that mentioning Null in an introductory course on relational theory would be a mistake. Nay, a crime. I’m glad you were able to keep it out of that course, and I’m certain that it’s better for it. Null is an anti-concept, and as such it corrupts not only databases, but minds.
If anyone thinks that the issue of Null is limited to databases, or that it is a closed issue, they are sorely mistaken. Here is a link to an ongoing thread about the issue of Null in a general-purpose programming language that is quickly becoming mainstream.
Great article that articulates how null can take on different meanings that ought properly to be distinguished and then captured in the types. Conversely, it should be possible to indicate that a tuple field can be restricted to only take non-null values [This type of thing is normally supported in relational dbs but only as a field attribute – it is not first class]. Is there any merit in elaborating this line of reasoning to consider whether the concept of discriminated sums [of which haskell’s Maybe type is an example] in algebraic data-types have general application to the relational model and to relational products such as rdbms, xquery or linq ?
An example I can immediately think of is the postgres-sql GIS extensions that supports sum type geographic representations (point, line, polygon etc). I also know that I have considered this issue many times when trying to work out a db design myself, and that things could have been much easier if I could have a field that could hold elements of a closed set of possible types. I am never sure if it is my own incomplete understanding of the relational model and the implications of normalization, or whether such a thing would be useful in general. Are you aware of any research or practical work in this direction or is my understanding incomplete ?
I highly recommend “Databases, Types, and the Relational Model” by Chris Date and Hugh Darwen.
What I don’t see is why identity and equivalence need to be combined. I agree that if all operations involving both can be simplified to only needing one, then they _should_ be combined. But that doesn’t seem the case here.
For example, if I have two bananas (literally, here in my hands), then it is useful to express they are equivalent but not identical. In contrast both are owned by me, banana1.owner is identical to banana2.owner, but the objects themselves are distinct in a physical sense.
This leads to the important result that the instantiation of a thing is different from the idea of that thing. The conceptual number 5 is different from me writing ‘5’ on a whiteboard, or tattooing it on my arm. 5=5 is true as an idea, but “five” = “FIVE” is less clear. Just as in math it is helpful to compare things so too must we compare instances of things, and now it is clear that identity isn’t enough, something like equivalence must exist to help us express these truths.
Now I completely agree that equivalence is an uglier concept then identity. I’d even say that equivalence is less elegant, less powerful, and more prone to programming errors (it’s meaning can change after all). But that still doesn’t mean it isn’t useful or necessary to express some truths about the world.
Where do you come down on this? Is there is there a place for equivalence in logical reasoning? In functional programming?
Thank you for the thought provoking post!
I was specifically talking about equivalence and identity of data values and programs. These are abstract mathematical entities, not physical existents like bananas. Physical things have distinct identity, as their measurable attributes differ (your bananas will differ in shape, colour, location, etc). But it’s totally meaningless to speak of a difference in the “fiveness” of five bananas and five cars. The number 5 is equal and identical with itself.
That said, it’s important, when speaking in the abstract, of establishing context for identity. We often talk of things being identical “up to” or “modulo” some equivalence relation. For example, while there’s no sense in which the list [one, two, three] is not identical to the list [one, two, three], we can say that the list [three, two, one] is identical to the list [two, one, three] “up to list order”. That is, we’re making clear that the relevant concept, that establishes identity, omits the measurable attribute of order in this context.
Great post, but there are a few, well, misconceptions.
See Embrace the unknown: Dealing with nulls in SQL and RDBMSs