2-valued domains in database design.

Intro

I have been wanting to put these thoughts in writing for a long time. A recent discussion on the TTM list made me decide to finally take the time to do it.

What follows will concern itself with the use of 2-valued domains in the discipline of database design. The most obvious example of such a domain is, of course, the domain of truth values {true, false}, often named 'Boolean'. Many other domains can be imagined, however, that are also 2-valued, such as GENDER {male, female} or KNOWN_STATE {known unknown}. The running example will, for that reason, be using an abstract domain 2V {1, 2}.

A case of database design equivalence

Consider a relvar R with predicate "Thing §thing_id§ exists and its 2V property is §2V§". A possible value for this relvar might be :

R
THING_ID 2V
BW_POSSIBLESPECIES.SPDB 1
AM4DP_1.SPDB 2
BW_SPOTS.SPDB 1
AM4DP_5.SPDB 2
DATABASECATALOG.SPDB 2
BW_SPECIES.SPDB 1
AM4DP_4.SPDB 2
BW_SPOTSXSPNAME.SPDB 2
AM4DP_3.SPDB 2
AM4DP_2.SPDB 2

It will be clear that horizontal partitioning could be applied to this relvar, with one partition holding only the thing_id's for which 2V is 1, the other partition those for which 2V is 2. Given that domain 2V has only two values, we don't need more partitions to hold the complete original relation, and we can use the restriction operator to compute the two partition values from the value for R :

R1
THING_ID 2V
BW_POSSIBLESPECIES.SPDB 1
BW_SPOTS.SPDB 1
BW_SPECIES.SPDB 1
R2
THING_ID 2V
AM4DP_1.SPDB 2
AM4DP_5.SPDB 2
DATABASECATALOG.SPDB 2
AM4DP_4.SPDB 2
BW_SPOTSXSPNAME.SPDB 2
AM4DP_3.SPDB 2
AM4DP_2.SPDB 2

As an alternative design for R, we could envisage a design that has the two relvars R1 and R2. Their predicates would be essentially the same as for R, but for the fact that they would also have to reflect the effects of the restriction to value 1 resp. 2 :

The simplified versions of the predicates clearly suggest that we can suffice with a unary relvar to represent this information. This is also obvious from looking at the nature of the restriction conditions applied, which leave no room for any 2V value other than the one in the restriction condition to appear in the result. (In normalisation theory, this would be epxressed as an FD {}->{2V} - the determinant for 2V is the empty set - which would suggest we'd have to at least split the R1/R2 relvars, moving their 2V attribute in its own table.) In our running example, that would yield the following spectacular relvars/relations :

R1
THING_ID
BW_POSSIBLESPECIES.SPDB
BW_SPOTS.SPDB
BW_SPECIES.SPDB
R2
THING_ID
AM4DP_1.SPDB
AM4DP_5.SPDB
DATABASECATALOG.SPDB
AM4DP_4.SPDB
BW_SPOTSXSPNAME.SPDB
AM4DP_3.SPDB
AM4DP_2.SPDB

The conclusion is that if a relation schema has an attribute 2V drawing its values from a 2-valued domain, then it is always possible to replace that design with an equivalent one that has two relation schemas (one for each value of the 2-valued domain), and where none of these schemas still have the attribute of the 2-valued domain. The equivalence between the two designs is as follows :

(Aside : obviously, similar observations can be made if the domain has 3 values, or 4, or any other number. The property just seems more appealing and interesting to exploit in the case of 2-valued domains.)

Variation on a theme

Something similar can be observed for domains which are not two-valued, but for which an equivalence relation consisting of two equivalence classes is definable. To begin with a not so appealing example : the domain of integer values can be divided in two classes, the odd ones and the even ones. Or the powers of two and the non-powers of two.

R :
FILENAME ISPOW2 PAGESIZE
AM4DP_3.SPDB True 32768
AM4DP_2.SPDB True 16384
AM4DP_1.SPDB True 8192
BW_SPOTS.SPDB True 16384
BW_POSSIBLESPECIES.SPDB True 16384
AM4DP_5.SPDB True 32768
DATABASECATALOG.SPDB True 32768
AM4DP_4.SPDB True 32768
BW_SPECIES.SPDB True 32768
BW_SPOTSXSPNAME.SPDB False 49152

The very same idea of "horizontal partitioning" illustrated before can be used to partition this relation :

R1 :
FILENAME PAGESIZE
AM4DP_5.SPDB 32768
DATABASECATALOG.SPDB 32768
AM4DP_4.SPDB 32768
BW_SPECIES.SPDB 32768
AM4DP_3.SPDB 32768
AM4DP_2.SPDB 16384
AM4DP_1.SPDB 8192
BW_SPOTS.SPDB 16384
BW_POSSIBLESPECIES.SPDB 16384
R2 :
FILENAME PAGESIZE
BW_SPOTSXSPNAME.SPDB 49152

The example is not so appealing because most likely, the domains of the 'pagesize' attribute in R1 and R2 will be the same. As a consequence, there are no operators/operations that we could or would want to invoke on the 'pagesize' attribute of R1/R2 that we could not invoke on the 'pagesize' attribute of the other. Expression writing is just as easy on R as it is on the R1/R2 alternative. We usually choose the R design, and not the split-up equivalent, because e.g. we can then avoid having to do separate physical design for two distinct relvars, and can suffice with a single one.

The story changes a bit if the domain of attributes like 'pagesize' are a tagged union, that includes special values such as 'UNKNOWN'.

transform(restrict ( extend (dbmsfile , ispow2( le(div(ln(tofloat(pagesize)),ln(float(2.0))),float(15.1)) )) , not(ispow2)),(filename,pgsize(string(UNKNOWN))))

R2 :
PGSIZE FILENAME
UNKNOWN BW_SPOTSXSPNAME.SPDB
R1 :
PGSIZE FILENAME
16384 BW_SPOTS.SPDB
16384 BW_POSSIBLESPECIES.SPDB
32768 AM4DP_5.SPDB
32768 DATABASECATALOG.SPDB
32768 AM4DP_4.SPDB
32768 BW_SPECIES.SPDB
32768 AM4DP_3.SPDB
8192 AM4DP_1.SPDB
16384 AM4DP_2.SPDB

The single-relation equivalent being :

R :
FILENAME PGSIZE
BW_SPOTS.SPDB 16384
BW_POSSIBLESPECIES.SPDB 16384
AM4DP_5.SPDB 32768
DATABASECATALOG.SPDB 32768
AM4DP_4.SPDB 32768
BW_SPECIES.SPDB 32768
BW_SPOTSXSPNAME.SPDB UNKNOWN
AM4DP_3.SPDB 32768
AM4DP_1.SPDB 8192
AM4DP_2.SPDB 16384

Now, it looks like expression writing on R for doing computations on the values in the 'pagesize' attribute, gets more tedious, because we have to explicitly exclude the tuples with "UNKNOWN" from the relation we process.

Typically, that problem will be addressed by defining R1/R2 as views on R, such that the overall database definition has, not one, but THREE relvars. I believe that "keeping the number of relvars low/manageable", is exactly the opposite of what will be achieved with one-relvar-tagged-union-domains in practice, if we include the virtual relvars in that count. (Which is reasonable because they are intended for using thus the user must remember their names.)

And if you don't define those views, the user is left with all that pesky excluding-the-special-values all over the place.

It may well be the case that there is no significant difference between the options of single-relvar-with-tagged-unions-attributes and the split-up version, because in the former case the "split-up" relvars will most likely be defined as views anyway, and in the latter case the UNION will be defined as a view for presentation purposes anyway. And the database equivalence between the two options is mathematically provable.