2-valued domains in database design. |
I have been wanting to put these thoughts in writing for a long time. A recent discussion on the TTM list made me decide to finally take the time to do it.
What follows will concern itself with the use of 2-valued domains in the discipline of database design. The most obvious example of such a domain is, of course, the domain of truth values {true, false}, often named 'Boolean'. Many other domains can be imagined, however, that are also 2-valued, such as GENDER {male, female} or KNOWN_STATE {known unknown}. The running example will, for that reason, be using an abstract domain 2V {1, 2}.
Consider a relvar R with predicate "Thing §thing_id§ exists and its 2V property is §2V§". A possible value for this relvar might be :
THING_ID | 2V |
---|---|
BW_POSSIBLESPECIES.SPDB | 1 |
AM4DP_1.SPDB | 2 |
BW_SPOTS.SPDB | 1 |
AM4DP_5.SPDB | 2 |
DATABASECATALOG.SPDB | 2 |
BW_SPECIES.SPDB | 1 |
AM4DP_4.SPDB | 2 |
BW_SPOTSXSPNAME.SPDB | 2 |
AM4DP_3.SPDB | 2 |
AM4DP_2.SPDB | 2 |
It will be clear that horizontal partitioning could be applied to this relvar, with one partition holding only the thing_id's for which 2V is 1, the other partition those for which 2V is 2. Given that domain 2V has only two values, we don't need more partitions to hold the complete original relation, and we can use the restriction operator to compute the two partition values from the value for R :
THING_ID | 2V |
---|---|
BW_POSSIBLESPECIES.SPDB | 1 |
BW_SPOTS.SPDB | 1 |
BW_SPECIES.SPDB | 1 |
THING_ID | 2V |
---|---|
AM4DP_1.SPDB | 2 |
AM4DP_5.SPDB | 2 |
DATABASECATALOG.SPDB | 2 |
AM4DP_4.SPDB | 2 |
BW_SPOTSXSPNAME.SPDB | 2 |
AM4DP_3.SPDB | 2 |
AM4DP_2.SPDB | 2 |
As an alternative design for R, we could envisage a design that has the two relvars R1 and R2. Their predicates would be essentially the same as for R, but for the fact that they would also have to reflect the effects of the restriction to value 1 resp. 2 :
The simplified versions of the predicates clearly suggest that we can suffice with a unary relvar to represent this information. This is also obvious from looking at the nature of the restriction conditions applied, which leave no room for any 2V value other than the one in the restriction condition to appear in the result. (In normalisation theory, this would be epxressed as an FD {}->{2V} - the determinant for 2V is the empty set - which would suggest we'd have to at least split the R1/R2 relvars, moving their 2V attribute in its own table.) In our running example, that would yield the following spectacular relvars/relations :
THING_ID |
---|
BW_POSSIBLESPECIES.SPDB |
BW_SPOTS.SPDB |
BW_SPECIES.SPDB |
THING_ID |
---|
AM4DP_1.SPDB |
AM4DP_5.SPDB |
DATABASECATALOG.SPDB |
AM4DP_4.SPDB |
BW_SPOTSXSPNAME.SPDB |
AM4DP_3.SPDB |
AM4DP_2.SPDB |
The conclusion is that if a relation schema has an attribute 2V drawing its values from a 2-valued domain, then it is always possible to replace that design with an equivalent one that has two relation schemas (one for each value of the 2-valued domain), and where none of these schemas still have the attribute of the 2-valued domain. The equivalence between the two designs is as follows :
R1 ::= PROJECT ( RESTRICT ( R , 2V=v1 ) , (all but 2V) )
(likewise for R2)
R ::= UNION ( EXTEND ( R1 , 2V(v1) ) , EXTEND ( R2 , 2V(v2) ) )
(Aside : obviously, similar observations can be made if the domain has 3 values, or 4, or any other number. The property just seems more appealing and interesting to exploit in the case of 2-valued domains.)
Something similar can be observed for domains which are not two-valued, but for which an equivalence relation consisting of two equivalence classes is definable. To begin with a not so appealing example : the domain of integer values can be divided in two classes, the odd ones and the even ones. Or the powers of two and the non-powers of two.
FILENAME | ISPOW2 | PAGESIZE |
---|---|---|
AM4DP_3.SPDB | True | 32768 |
AM4DP_2.SPDB | True | 16384 |
AM4DP_1.SPDB | True | 8192 |
BW_SPOTS.SPDB | True | 16384 |
BW_POSSIBLESPECIES.SPDB | True | 16384 |
AM4DP_5.SPDB | True | 32768 |
DATABASECATALOG.SPDB | True | 32768 |
AM4DP_4.SPDB | True | 32768 |
BW_SPECIES.SPDB | True | 32768 |
BW_SPOTSXSPNAME.SPDB | False | 49152 |
The very same idea of "horizontal partitioning" illustrated before can be used to partition this relation :
FILENAME | PAGESIZE |
---|---|
AM4DP_5.SPDB | 32768 |
DATABASECATALOG.SPDB | 32768 |
AM4DP_4.SPDB | 32768 |
BW_SPECIES.SPDB | 32768 |
AM4DP_3.SPDB | 32768 |
AM4DP_2.SPDB | 16384 |
AM4DP_1.SPDB | 8192 |
BW_SPOTS.SPDB | 16384 |
BW_POSSIBLESPECIES.SPDB | 16384 |
FILENAME | PAGESIZE |
---|---|
BW_SPOTSXSPNAME.SPDB | 49152 |
The example is not so appealing because most likely, the domains of the 'pagesize' attribute in R1 and R2 will be the same. As a consequence, there are no operators/operations that we could or would want to invoke on the 'pagesize' attribute of R1/R2 that we could not invoke on the 'pagesize' attribute of the other. Expression writing is just as easy on R as it is on the R1/R2 alternative. We usually choose the R design, and not the split-up equivalent, because e.g. we can then avoid having to do separate physical design for two distinct relvars, and can suffice with a single one.
The story changes a bit if the domain of attributes like 'pagesize' are a tagged union, that includes special values such as 'UNKNOWN'.
transform(restrict ( extend (dbmsfile , ispow2( le(div(ln(tofloat(pagesize)),ln(float(2.0))),float(15.1)) )) , not(ispow2)),(filename,pgsize(string(UNKNOWN))))
PGSIZE | FILENAME |
---|---|
UNKNOWN | BW_SPOTSXSPNAME.SPDB |
PGSIZE | FILENAME |
---|---|
16384 | BW_SPOTS.SPDB |
16384 | BW_POSSIBLESPECIES.SPDB |
32768 | AM4DP_5.SPDB |
32768 | DATABASECATALOG.SPDB |
32768 | AM4DP_4.SPDB |
32768 | BW_SPECIES.SPDB |
32768 | AM4DP_3.SPDB |
8192 | AM4DP_1.SPDB |
16384 | AM4DP_2.SPDB |
The single-relation equivalent being :
FILENAME | PGSIZE |
---|---|
BW_SPOTS.SPDB | 16384 |
BW_POSSIBLESPECIES.SPDB | 16384 |
AM4DP_5.SPDB | 32768 |
DATABASECATALOG.SPDB | 32768 |
AM4DP_4.SPDB | 32768 |
BW_SPECIES.SPDB | 32768 |
BW_SPOTSXSPNAME.SPDB | UNKNOWN |
AM4DP_3.SPDB | 32768 |
AM4DP_1.SPDB | 8192 |
AM4DP_2.SPDB | 16384 |
Now, it looks like expression writing on R for doing computations on the values in the 'pagesize' attribute, gets more tedious, because we have to explicitly exclude the tuples with "UNKNOWN" from the relation we process.
Typically, that problem will be addressed by defining R1/R2 as views on R, such that the overall database definition has, not one, but THREE relvars. I believe that "keeping the number of relvars low/manageable", is exactly the opposite of what will be achieved with one-relvar-tagged-union-domains in practice, if we include the virtual relvars in that count. (Which is reasonable because they are intended for using thus the user must remember their names.)
And if you don't define those views, the user is left with all that pesky excluding-the-special-values all over the place.
It may well be the case that there is no significant difference between the options of single-relvar-with-tagged-unions-attributes and the split-up version, because in the former case the "split-up" relvars will most likely be defined as views anyway, and in the latter case the UNION will be defined as a view for presentation purposes anyway. And the database equivalence between the two options is mathematically provable.