Securing your SIRA_PRISE databases

Intro

I have been wanting to put these thoughts in writing for a long time. A recent discussion on the TTM list made me decide to finally take the time to do it.

What follows will concern itself with the use of 2-valued domains in the discipline of database design. The most obvious example of such a domain is, of course, the domain of truth values {true, false}, often named 'Boolean'. Many other domains can be imagined, however, that are also 2-valued, such as GENDER {male, female} or KNOWN_STATE {known unknown}. The running example will, for that reason, be using an abstract domain 2V {1, 2}.

A case of database design equivalence

Consider a relvar R with predicate "Thing §thing_id§ exists and its 2V property is §2V§". A possible value for this relvar might be :

R
THING_ID	2V
BW_POSSIBLESPECIES.SPDB	1
AM4DP_1.SPDB	2
BW_SPOTS.SPDB	1
AM4DP_5.SPDB	2
DATABASECATALOG.SPDB	2
BW_SPECIES.SPDB	1
AM4DP_4.SPDB	2
BW_SPOTSXSPNAME.SPDB	2
AM4DP_3.SPDB	2
AM4DP_2.SPDB	2

It will be clear that horizontal partitioning could be applied to this relvar, with one partition holding only the thing_id's for which 2V is 1, the other partition those for which 2V is 2. Given that domain 2V has only two values, we don't need more partitions to hold the complete original relation, and we can use the restriction operator to compute the two partition values from the value for R :

R1
THING_ID	2V
BW_POSSIBLESPECIES.SPDB	1
BW_SPOTS.SPDB	1
BW_SPECIES.SPDB	1

R2
THING_ID	2V
AM4DP_1.SPDB	2
AM4DP_5.SPDB	2
DATABASECATALOG.SPDB	2
AM4DP_4.SPDB	2
BW_SPOTSXSPNAME.SPDB	2
AM4DP_3.SPDB	2
AM4DP_2.SPDB	2

As an alternative design for R, we could envisage a design that has the two relvars R1 and R2. Their predicates would be essentially the same as for R, but for the fact that they would also have to reflect the effects of the restriction to value 1 resp. 2 :

R1 : "Thing §thing_id§ exists and its 2V property is §2V§ and §2V§ equals 1", which simplifies to "Thing §thing_id§ exists and its 2V property is 1"
R2 : "Thing §thing_id§ exists and its 2V property is §2V§ and §2V§ equals 2", which simplifies to "Thing §thing_id§ exists and its 2V property is 2"

The simplified versions of the predicates clearly suggest that we can suffice with a unary relvar to represent this information. This is also obvious from looking at the nature of the restriction conditions applied, which leave no room for any 2V value other than the one in the restriction condition to appear in the result. (In normalisation theory, this would be epxressed as an FD {}->{2V} - the determinant for 2V is the empty set - which would suggest we'd have to at least split the R1/R2 relvars, moving their 2V attribute in its own table.) In our running example, that would yield the following spectacular relvars/relations :

R1
THING_ID
BW_POSSIBLESPECIES.SPDB
BW_SPOTS.SPDB
BW_SPECIES.SPDB

R2
THING_ID
AM4DP_1.SPDB
AM4DP_5.SPDB
DATABASECATALOG.SPDB
AM4DP_4.SPDB
BW_SPOTSXSPNAME.SPDB
AM4DP_3.SPDB
AM4DP_2.SPDB

The conclusion is that if a relation schema has an attribute 2V drawing its values from a 2-valued domain, then it is always possible to replace that design with an equivalent one that has two relation schemas (one for each value of the 2-valued domain), and where none of these schemas still have the attribute of the 2-valued domain. The equivalence between the two designs is as follows :

Expressing R1/R2 in terms of R is done as R1 ::= PROJECT ( RESTRICT ( R , 2V=v1 ) , (all but 2V) ) (likewise for R2)
Expressing R in terms of R1/R2 is done as R ::= UNION ( EXTEND ( R1 , 2V(v1) ) , EXTEND ( R2 , 2V(v2) ) )

(Aside : obviously, similar observations can be made if the domain has 3 values, or 4, or any other number. The property just seems more appealing and interesting to exploit in the case of 2-valued domains.)

Variation on a theme

Something similar can be observed for domains which are not two-valued, but for which an equivalence relation consisting of two equivalence classes is definable. To begin with a not so appealing example : the domain of integer values can be divided in two classes, the odd ones and the even ones. Or the powers of two and the non-powers of two.

R :
FILENAME	ISPOW2	PAGESIZE
AM4DP_3.SPDB	True	32768
AM4DP_2.SPDB	True	16384
AM4DP_1.SPDB	True	8192
BW_SPOTS.SPDB	True	16384
BW_POSSIBLESPECIES.SPDB	True	16384
AM4DP_5.SPDB	True	32768
DATABASECATALOG.SPDB	True	32768
AM4DP_4.SPDB	True	32768
BW_SPECIES.SPDB	True	32768
BW_SPOTSXSPNAME.SPDB	False	49152

The very same idea of "horizontal partitioning" illustrated before can be used to partition this relation :

R1 :
FILENAME	PAGESIZE
AM4DP_5.SPDB	32768
DATABASECATALOG.SPDB	32768
AM4DP_4.SPDB	32768
BW_SPECIES.SPDB	32768
AM4DP_3.SPDB	32768
AM4DP_2.SPDB	16384
AM4DP_1.SPDB	8192
BW_SPOTS.SPDB	16384
BW_POSSIBLESPECIES.SPDB	16384

R2 :
FILENAME	PAGESIZE
BW_SPOTSXSPNAME.SPDB	49152

The example is not so appealing because most likely, the domains of the 'pagesize' attribute in R1 and R2 will be the same. As a consequence, there are no operators/operations that we could or would want to invoke on the 'pagesize' attribute of R1/R2 that we could not invoke on the 'pagesize' attribute of the other. Expression writing is just as easy on R as it is on the R1/R2 alternative. We usually choose the R design, and not the split-up equivalent, because e.g. we can then avoid having to do separate physical design for two distinct relvars, and can suffice with a single one.

The story changes a bit if the domain of attributes like 'pagesize' are a tagged union, that includes special values such as 'UNKNOWN'.

transform(restrict ( extend (dbmsfile , ispow2( le(div(ln(tofloat(pagesize)),ln(float(2.0))),float(15.1)) )) , not(ispow2)),(filename,pgsize(string(UNKNOWN))))

R2 :
PGSIZE	FILENAME
UNKNOWN	BW_SPOTSXSPNAME.SPDB

R1 :
PGSIZE	FILENAME
16384	BW_SPOTS.SPDB
16384	BW_POSSIBLESPECIES.SPDB
32768	AM4DP_5.SPDB
32768	DATABASECATALOG.SPDB
32768	AM4DP_4.SPDB
32768	BW_SPECIES.SPDB
32768	AM4DP_3.SPDB
8192	AM4DP_1.SPDB
16384	AM4DP_2.SPDB

The single-relation equivalent being :

R :
FILENAME	PGSIZE
BW_SPOTS.SPDB	16384
BW_POSSIBLESPECIES.SPDB	16384
AM4DP_5.SPDB	32768
DATABASECATALOG.SPDB	32768
AM4DP_4.SPDB	32768
BW_SPECIES.SPDB	32768
BW_SPOTSXSPNAME.SPDB	UNKNOWN
AM4DP_3.SPDB	32768
AM4DP_1.SPDB	8192
AM4DP_2.SPDB	16384

Now, it looks like expression writing on R for doing computations on the values in the 'pagesize' attribute, gets more tedious, because we have to explicitly exclude the tuples with "UNKNOWN" from the relation we process.

Typically, that problem will be addressed by defining R1/R2 as views on R, such that the overall database definition has, not one, but THREE relvars. I believe that "keeping the number of relvars low/manageable", is exactly the opposite of what will be achieved with one-relvar-tagged-union-domains in practice, if we include the virtual relvars in that count. (Which is reasonable because they are intended for using thus the user must remember their names.)

And if you don't define those views, the user is left with all that pesky excluding-the-special-values all over the place.

It may well be the case that there is no significant difference between the options of single-relvar-with-tagged-unions-attributes and the split-up version, because in the former case the "split-up" relvars will most likely be defined as views anyway, and in the latter case the UNION will be defined as a view for presentation purposes anyway. And the database equivalence between the two options is mathematically provable.

2-valued domains in database design.

Intro

A case of database design equivalence

Variation on a theme