This document essentially contains an introduction to the relational algebra that is implemented in SIRA_PRISE. Depending on your background, you may already have some or very extensive knowledge about the subject, in which case you probably do not need to read this document.
If, however, your knowledge of and about the Relational Model of Data was primarily/solely obtained through the usage of SQL-based systems, and terms such as "relvar", "relation" and "tuple" sound completely unfamiliar to you, then carefully studying this document is probably a sine qua non for you to understand and effectively use this new system you've got in your hands.
This introduction is subdivided in several parts :
You may have wondered, at some time, why "Relational databases" (and "Relational dbms's") are called how they are. You may have assumed that it must have had something to do with this thing called "Entity-Relationship modeling" that you may already have practiced at some point in your career. Your assumption was wrong. In fact, the first mention of the term "relational" in the context of data management even predates the invention of E/R modeling ... So where dóes the term come from ?
Well, the term comes from something that we have presumably all been taught about in our math classes, somewhere between the age of 10 and 20, that thing being the "relation". The Relational Model of Data (RM for short) is no more or less than a mechanism to represent data in general, and which has that same mathematical concept of a relation as its key building brick. Before digging into the details of the algebra, it is first necessary to define a couple of terms that are frequently used when talking about the RM. That is where we will start this introduction of the relational algebra.
Lots of people collect things. Stamps, coins, rare books, art reproductions or even original art masterpieces, just name it and someone exists who collects it. You probably have spent some time in your life too, trying to build up some collection. If you have, doubtless that there have been some items that you have managed to collect more than once. It was problaby fun trading those "duplicates" with fellow collectors for items that you had not yet managed to collect. Well, a "collection" is a very fundamental concept in mathematics, one that is in fact needed to define what a "set" is.
In mathematics, a "collection" is, loosely speaking, any possible gathering in any possible quantity of any possible things. The gathering of (chair, cloud, oak tree, chair) is a collection. Note that the collection has two chairs.
Now, if a collection has the particular property that none of the items in the collection appear more than once, then we call this collection a "set". This property turns out to be a very useful and important one, and therefore sets, and set theory, are very important in mathematics. So important in fact that relational database technology could and would not even have been invented without them. Important enough also, to be repeated in bold :
A
set is a collection that does not contain duplicates.
Something that does contain duplicates, is not a set.
So {A,1,oak tree,patent} is a set. {A,patent,A} is not. Now, in the context of data management problems, which is what we're interested in, sets such as those are of course not very useful. In the context of data management, sets are mostly useful only if the items they contain are in some way "alike". In that context, we usually deal with sets that contain only numbers, only dates, or some such. Useful sets in data management are sets such as {Monday, Tuesday, Wednesday, Thursday, Friday} and {true, false}. Without attempting a formal and precise definition of the term, we can (and do) say that a Type is such a "useful" set. And a Value is one of the items contained in a Type. Which brings us to :
A Type is a set of Values.
In data management, we will often need to speak about, or refer to, such Types. For that reason, it is customary that all the Types we use are assigned a name. The type {Monday, Tuesday, Wednesday, Thursday, Friday}, e.g. could be assigned the name 'Weekdays'. The type {true, false}, e.g. could be assigned the name 'Boolean'. Such that we can say :
A Type is a named set of Values.
You might wonder at the mutual dependency in the definition here. A type is a set of values, and a value is an item in a type. In the end, that doesn't really define much, does it ? Shouldn't we have some other definition of what a 'Value' is ? Well, for the purpose of this introduction, that isn't really necessary. The reason for this is that the RM (and the algebra) can work with just any set of types. The RM does not "fall apart" if the set of integers is not defined as a type, nor does the algebra. (It would be difficult to build a useful database without that type, but that's another issue.) And the most important consequence is that this observation is equally valid "the other way round" : the RM does not stop you from defining whatever type you like. The RM gives you no answer to the question when or why something is a value, and when or why not. As far as the RM is concerned, a value is whatever you want it to be, and a type is whatever set you define it to be. It only depends on whether you have a useful purpose for the type or not. A set such as {Em, C, G, ...} might look pretty insane and useless to most readers, but a guitar player would certainly have some good uses for a type holding the guitar chords as values ...
Looking at our example types Weekday and Boolean, it's not difficult to see how one can "pick a value from the first type, then pick a value from the second one, then arrange those two picked values to form a pair.". One example of such a pair would be (Monday, false). There are 10 such pairs possible, and those 10 pairs can be gathered into a set. Such a set turns out to be a very important concept in mathematics too, so important in fact that the concept of such a set is given a name : the "Cartesian Product".
The Cartesian product of two sets is the set of all possible pairs formed from the items in those two sets.
The term "Cartesian product" is not often encountered in the world of RM. The reason for this will become clear in the next few sections. Nevertheless it is worthwhile to reiterate in a more informal manner : A Cartesian product is a set, and the things that we find in that set are "pairs of other things", and those "other things" are precisely the things that are part of one of the sets that were used to form the Cartesian product.
Traditional set theory then goes on to define the concept of a "Relation" :
A Relation is a subset of a Cartesian product of two sets.
Examples of such relations are then : {(Monday, false),(Monday, true)}, {}, {(Tuesday, false),(Thursday, true)}, ...
Once again it is worthwhile to reiterate in a more informal manner : A Relation is a set, and the things that we find in that set are "pairs of other things", and those "other things" are precisely the things that are part of one of the sets that were used to form the Cartesian product of which the Relation happens to be a subset.
Following this definition, a relation can be (and is indeed) defined as being a set of pairs. For the context of data management problems, this definition isn't so useful, however. Indeed, when recording data about our customers, sales, ..., we usually record a lot more things than just two. So for the specific purpose of data management, the concept of a relation had to be "adapted" somewhat ("generalised" is the correct word, actually) to cater for that problem. It is not difficult to see how this can be done : instead of "picking values from two types and arrange those", one can also "pick one value from each of any number of types, and arrange those". The set that is formed by gathering all possible such combinations, is no longer called the "Cartesian" product, but rather the "generalized" product :
The generalized product of n types is the set of all possible combinations of values formed from those n types
The term "generalized product" is not often encountered in the RM world. It only serves as a step-up to the definition of what a Relation is in the RM world.
Once again it is worthwhile to reiterate in a more informal manner : A Generalized Product is a set, and the things that we find in that set are "combinations of other things", and those "other things" are, precisely, the Values that are part of one of the Types that were used to form the Generalized Product.
An important remark is to be made here. The generalized product is defined to be a set, specifically. And when defining a Type, it was said that a type is a set, and that the items that a type contains are called 'Values'. So the question could arise, if a generalized product is a set, and all it takes to be a type is to be a set of values, could we regard a generalized product as a type, and the items that the generalized product contains as values ? The answer to whether we can is obviously 'yes', but more important, is there any benefit in doing so ? This issue will be addressed when discussing a.o. the GROUP operator of the algebra, but it is worthwhile to remember already now :
Each generalized product can be considered as a Type in its own right
The concept of a Relation can now be generalized as follows (and it is this definition that is the cornerstone of the RM - and thus of the algebra too) :
A Relation is a subset of the generalized product of n types
Informally, this translates to "A Relation is a set, and the things that we find in that set are "combinations of other things", and those "other things" are, precisely, the Values that are part of one of the Types that were used to form the Generalized Product that the Relation happens to be a subset of.".
It is also important to
remember that, in the context of the Relational Model of Data, the word
"Relation" always means the kind of set defined here, and not to the
more "traditional" definition that we were/are usually taught in our
math classes.
The items found in a Relation (RM version) are then "pairs" having three, or twenty, or in fact any number of values. The name "pair" of course then becomes somewhat inappropriate to label such things, and of course, it is also desirable to just have a term, preferably other than "combination of n Values", to use whenever we want to refer to these things that form a Relation. That term is "Tuple".
A Tuple is one item of/in the generalized product of n types,
A Tuple is a combination of values drawn from n types
First and foremost, we have already established that a generalized product can be considered as a type in its own right. Implying that the things that such a generalized product contains, are themselves 'values'. Given that we have just baptised these things as 'Tuples', we can conclude :
A tuple is, itself too, a value.
Given that a generalized product contains only tuples as its members, and given that a Relation was defined to be a subset of a generalized product, the following is obviously true :
A Relation is a set of tuples (formed from n types)
By now, the question may rise, "How is all of this useful to the field of databases ?". The answer to this question is best illustrated by revisiting the previously given WeekDay/Boolean example. The "Generalized Product" as previously defined, when applied to the Weekday/Boolean example can be represented as follows :
Monday | true |
Monday | false |
Tuesday | true |
Tuesday | false |
Wednesday | true |
Wednesday | false |
Thursday | true |
Thursday | false |
Friday | true |
Friday | false |
One tuple of this "Generalized Product" can be represented as follows :
Wednesday | false |
And one Relation that is a subset of this "Generalized Product" can be represented as follows :
Monday | true |
Tuesday | true |
Wednesday | false |
Thursday | true |
Friday | true |
These representations may start ringing a bell. They all look like a table, and that's not coincidence. When doing your database work, you probably refer to these things as "row" and "table", which are the "synonyms" that were introduced by SQL (the language that you probably use -in some dialect- to manage your databases) for the terms "tuple" and "relation", respectively.
For the purposes of the present document (which is to clarify the algebra), it is not that terribly incorrect to think of a tuple as "being a row" and of a relation as "being a table". However, the algebra is defined in terms of tuples and relations, not in terms of rows and tables (they are not truly the same thing), so this document will consistently use the terms "tuple" and "relation".
Nevertheless, for a proper
understanding of other textbooks that attach more importance to using
formally precise terminology, it is indeed important for your attention
to be drawn to
In SQL, the "table" is undoubtedly a rather important concept. But what exactly does the SQL term "table" refer to ? You have just been presented three different things that are all three also tables, and only the third one of them would really correspond to the meaning that SQL attaches to the word. The second one would rather correspond the the SQL term "row", and the first one wouldn't really correspond to any SQL term at all ... This only goes to show how SQL's choice of terminology leads to a regrettable blurring between concepts that are quite distinct, as far as the theoretical foundations of the RM are concerned. The mathematical objects that have been presented here are really quite distinct, and in fact the only thing they have in common is that the same presentational technique (a table) might be used to visualise them. But that does not make the visualised mathematical concepts themselves, identical.
There is a very important consequence that derives from the way that the concept of a Relation is defined. A relation is a set, and a set is defined to be a collection that has no duplicates. See the consequence ? No tuple can appear twice in a relation. Or else it is no longer a relation.
SQL deviates from this principle, and you may wonder whether that is really so problematic. It is beyond the scope of this document to discuss the issue profoundly, suffice it to say that SIRA_PRISE faithfully follows the following rule (which stems from the way the RM is defined) :
No tuple can ever appear twice in a relation
It might have gone unnoticed, but another crucial thing is the 'n' in the definition. In the example of the Weekday/Booleans, does the generalized product contain any tuple of only one value ? Or can it possibly contain any tuple in which, e.g., the day is _not_ a weekday ? Obviously, the answer to both these questions is "no". And given that a relation is a subset of this generalized product, the same goes for any relation over those types.
SQL deviates from this principle by offering the possibility to include nulls in its rows. On this issue too, entire libraries have already been filled with the discussion of whether or not this is really a problem. Once again, it is beyond the scope of this document to add to that discussion, so once again it is only observed here (and important to remember) that SIRA_PRISE faithfully applies the rules of the model :
In
a
Relation, all tuples contain an equal number of values, and that number
of values is equal to the number of Types that the generalized product
was defined with
No tuple can contain any value that is not a value of the corresponding
Type
When we mentioned the definition of Cartesian product and the pairs of values that it consists of, we incorrectly spoke of the Cartesian product as being "the set of all such possible pairs". You may have noticed the error : the ordering of the values in the pair was overlooked. It was tacitly assumed that the values picked from the "first" type would have to occupy the first position in the pair, and the values picked from the "second" type would have to occupy the second position in the pair. Otherwise it would also have been possible, in the given example, to construct pairs such as (false, Friday).
So in the way that mathematics usually deals with relations and such, the order of the values is important and relevant. Implying that if we are given such a pair, and we need to "extract" a value from such a pair, then we need to know whether it is the first or the second value we're interested in.
Given that we're interested in data management, and that we will often be using relations and tuples with a significantly larger number of values, is that an interesting way to work ? Let's imagine we have some generalized product of 20 types (think of this as the declaration of a 20-column table), and all programs that operate on that table must refer to the values as "the sixteenth value", or "the thirteenth value" (for lack of another means to refer to those values precisely and unambiguously). Apart from the fact that it is mentally very tiring to write programs in that mode (causing way more programmer errors), consider what would need to be done if at one point in time we no longer needed the "third" value, and wanted to remove the corresponding type from the generalized product ("drop the column from the table"). All references in all programs to values beyond the third would have to be decreased by one ...
That's definitely not a good idea. Which is why the RM explicitly states that all values in a tuple must be associated with a name, and that that name should be the only way available to address that value.
You probably observed that the tables the previous example lacked something essential, and indeed you were right. What was lacking were precisely the names which are needed to establish "addressibility" of the values in the tuples. These names are called the attribute names. The table from the examples should actually have looked more like the following :
DAY | GOTOSCHOOL |
---|---|
Monday | true |
Tuesday | true |
Wednesday | false |
Thursday | true |
Friday | true |
Now, what does this prescription imply for the tuples that the algebra works with ? For the tuples and/or relations themselves, in fact it doesn't make so much of a difference. But it does for the way in which we represent those tuples and/or relations. When we have the attribute names, we can represent a tuple of this relation in a non-tabular way too : {DAY:Monday, GOTOSCHOOL:true}. And it wouldn't make a difference if we wrote {GOTOSCHOOL:true, DAY:Monday} instead.
Likewise, the ordering of neither the rows nor the columns in the tabular representation of a relation is material. The following relation is exactly the same as the previous one :
GOTOSCHOOL | DAY |
---|---|
true | Thursday |
true | Tuesday |
false | Wednesday |
true | Monday |
true | Friday |
In general, it can therefore be stated that :
The RM has no concept of ordering or ordinal position of attributes within tuples, no concept of ordering or ordinal position of attributes within relations, and no concept of ordering or ordinal position of tuples within relations.
Observe that the purpose of the attribute names implies that an attribute name cannot appear twice in a tuple. If an attribute name were allowed to appear twice in a tuple, then we would no longer be able to use that attribute name to uniquely identify the value of that attribute in a tuple. We can therefore also say :
A tuple is a set of named attribute values
Having put all this in place, we can make a number of important observations and define a few more useful terms wrt the RM.
We have already observed that in a relation, each tuple has an equal number of attribute values (and that number is equal to the number of types that were used to define the generalized product). That number is typically referred to as the degree of the relation and its tuples. SQL users may think of this as the "number of columns in a table".
Given that a relation is defined to be a set, it follows naturally that a number of concepts from set theory in mathematics apply equally well to relations. One such concept is the number of tuples that the relation has. In set theory, this number is called the cardinality of the set. The RM has retained the term and speaks of the number of tuples in a relation as that relation's cardinality. (Since we have established that a tuple is also a set, it is theoretically possible to speak of the cardinality of a tuple, too. However, since this concept has already been baptised "degree", that is usually not done.)
It will also be clear that in a relation, all tuples will not only hold an equal number of attribute values, but the set of attribute names used to refer to those values, will be the same no matter which tuple one looks at. It will furthermore be clear (even if it has not been stated explicitly) that the values in any two tuples for a certain attribute name, will be values from the same type, specifically. There will be no two tuples where one has a value from GregorianDate in attribute X, and the other has a value from some other type in attribute X.
It is thus possible to list all the attribute names of a relation, and next to each name put the name of the type from which the tuples draw their values for that attribute. In the example given previously, this list would look like : GOTOSCHOOL:Boolean, DAY:Weekday. This list is typically referred to as the heading of the relation and of the tuples.
Keys are, obviously, an important concept in relational databases. But as far as the purpose of this page is concerned, which is to explain the algebra, and to introduce the terminolgy that is needed to do so, the concept of keys bears no relevance.
In fact, it is, strictly speaking, even plain wrong to speak of "the key of a relation". In the RM, a key is something that applies to a relation variable, not to the values (the relations) such a variable possibly contains. A key is, in a sense, nothing more than a user-friendly facility that allows to define certain specific classes of database constraint. This chapter does not deal with constraints, and so there need be no mention here of the concept of 'keys'.
Time to start talking about the algebra. You may wonder at the idea of an 'algebra' with relations ... Wasn't 'algebra' about computing ? And isn't computing performed on numbers exclusively ? The answer is, obviously, 'no'. George Boole defined an algebra that allowed us to do computations on the logical truth values 'true' and 'false'. An algebra exists that allows us to do computations on matrices. The operations of union, intersection and difference as they are applied in set theory, form an algebra that is used to do computations on sets. And so on and so on ...
Then what does it take to 'be an algebra' ? How is it that the Relational Algebra (RA for short) is indeed an algebra ? Well, an algebra is essentially nothing but a set of computation operators. Ordinary arithmetic on numbers, e.g., is an algebra consisting of the operators addition, subtraction, multiplication, division, exponentiation. Set algebra includes the operators union, intersection and difference. Likewise, the RA is essentially just a set of operators that "compute" with relations as they have been defined in the introductory chapters.
And just as the addition and multiplication operators operate (in their most basic form) on two numbers to yield another number, so will an operator of the RA (commonly referred to as a 'relational operator') in its most basic form operate on two relations to yield another one.
You might think that the answer to this question is 'no'. After all, you've been using SQL for years already perhaps, and no one's ever mentioned the word 'algebra' in connection to SQL, right ? Well, it may be right that you've never heard anyone mention the word 'algebra' in connection with SQL. Nevertheless, the connection does indeed exist. The SQL language has, since day 1 of its existence, been intended as an implementation of the RA. Every query that you write is nothing more or less than an expression, a formula, of the RA. So if you've been using SQL, you've probably already written dozens of expressions of the algebra. You just weren't aware.
We mentioned earlier that, as far as the purpose of the present document is concerned, it is not terribly wrong to think of a relation as 'being a table'. At this point, however, we should clarify an important distinction that SQL mostly fails to make, and hence that SQL users mostly fail to recognize. When programming, you probably define variables that you subsequently use to store values in. You do so by "declaring" those variables with a certain name (e.g. int c;), after which you can assign a value to (i.e. store a value in) that same variable (e.g. c=1;). You usually also declare the type of values that the variable is allowed to store (integers in the example).
In much the same sense, SQL's tables can also be considered a "variable" of the programming system. Not one that is declared locally for some small program procedure, but rather one that is declared globally for every program that accesses the database. (The C programming language, and probably lots of others too, has the concept of 'static external' variables, which is somewhat similar.) As any other variable, it can be considered as "storing", or "containing" a value. That value being, precisely, the relation that consists of all the tuples that were "added" (and not yet deleted back again). A variable that is declared to contain integer values specifically, is usually referred to as an "integer variable". By analogy, the RM uses the term "relation variable" for a variable that is declared to contain relation values specifically. RM literature often abbreviates that term to "relvar".
SQL blurs this distinction rather seriously by using the word 'table' for both of these concepts (the word stands both for the variable and for the -relation- value it contains).
When discussing the operators of the RA, it is important to remember that they always deal with relation _values_ specifically, without any consideration as to where and/or how that relation value originated. In particular, an operator of the RA does not prescribe that its input can only be a "table" (in the 'variable' sense of the word). Quite the contrary : it is perfectly possible for the input of a relational operator to be some other expression of the RA. The input to a projection can be the result of a restrict, and the input to that restrict can be the result of a join, etc. etc. In other words : expressions of the RA can have any arbitrary level of nesting, just so long as those expressions themselves evaluate to relation values, i.e. yield a relation value.
The examples in the discussion of the relational operators below, will adhere closely to the syntax used by SIRA_PRISE. For purposes of readability however, certain expressions in the examples will not be given using the syntactic notation used by SIRA_PRISE. This is mainly for purposes of readability, and for maintaining a higher degree of "visual familiarity" towards database users who are used to SQL expressions.
The operators that we will be introducing here, are the following : restriction, projection, extension, renaming, intersection, union, difference, join, division, semijoin, semidifference, transitive closure, grouping, ungrouping, aggregation and summarization.
Note that the operators' "definitions" as we will present them here are not always "formal" definitions. The only goal we try to achieve is to explain the basic operation that the operator stands for, which can often be achieved equally well by explaining the basic algorithm it performs.
Recall the relation that was presented in tabular form earlier on :
DAY | GOTOSCHOOL |
---|---|
Monday | true |
Tuesday | true |
Wednesday | false |
Thursday | true |
Friday | true |
Often we will want to inspect a relation value and "retain" only those tuples of the inspected relation that satisfy certain conditions. The restriction operator is the operator that can perform that task for us. The restrict operator takes a relation as input, plus a boolean expression, and yields a relation consisting of all and exactly those tuples of the input relation for which the boolean expression evaluates to 'true'.
Assuming our relation is
the current value of the relvar named 'R', then 'RESTRICT(R
,
GOTOSCHOOL = false)
' will
produce the relation
DAY | GOTOSCHOOL |
---|---|
Wednesday | false |
Observe that just any boolean expression is allowed as the restrict condition, and that there is no obligation for the expression defining this restrict condition to reference any attribute of the relation that is restricted. So the following expressions are all valid :
RESTRICT(R
,
false)
RESTRICT(R
, true)
RESTRICT(R
, 1 = 1)
RESTRICT(R
, 1 = 2)
RESTRICT(R
, GOTOSCHOOL)
RESTRICT(R
, NOT(GOTOSCHOOL))
Another thing that may be less obvious : the boolean expression can itself be of arbitrary complexity, and can also contain references to other relation variables ! It is, e.g. perfectly possible to check whether an integer attribute in some tuple contains a value that is equal to the cardinality of some other relvar, as in 'X where score = cardinality(Y)', where score is an attribute of X, Y is some other database relvar, and cardinality is an operator that returns the number of tuples in (the relation that) its argument (evaluates to).
In information processing, it is very unlikely that all programs will always need to inspect the values of all attributes in a relation. Often, databases also contain 'comment' fields and 'label' fields which only serve a descriptive purpose, and which often aren't needed by the programs that do the "core business" of the organisation.
So it might be interesting to be able to build, starting from a given relation, another relation that has only the attributes we really need. This is precisely what the projection operator can do for us. The projection operator takes a relation as one argument, and a set of attributenames as the other (where obviously, that set can only contain names of attributes that effectively appear in the relation). It then produces a relation that holds one tuple for each combination of values that appears in the input relation for the given list of attributes (that tuple being precisely that combination of values).
The
projection of the example relation on the DAY attribute ( PROJECT(R
, (DAY))
) gives the
following result :
DAY |
---|
Monday |
Tuesday |
Wednesday |
Thursday |
Friday |
The
projection of the example relation on the GOTOSCHOOL attribute
( PROJECT(R ,
(GOTOSCHOOL))
) gives
the
following result :
GOTOSCHOOL |
---|
true |
false |
No doubt this result looks strange to you. But take a close look at the way the definition for the result of the projection operator is formulated. The result of a projection contains one tuple "for each combination of values found for the projection attributes in the input relation". Only two such combinations exist : one that holds the boolean value 'true' and one that holds the boolean value 'false'. It does not matter how many times this "combination of values" appears in the input, only one can appear in the result of the projection. Remember that the projection operator is supposed to return a relation, that a relation was defined to be a set of tuples, and that therefore, the result of a projection cannot possibly contain the same tuple (as you might have expected in this example) 4 times. Otherwise, the result of the projection would not be a relation anymore (at least not guaranteeably in all cases). It is important to remember that the projection operator of the algebra always "eliminates duplicates" in its result.
Another special case is the
projection on no attributes at all ( PROJECT(R
, ())
) , which is also a
valid projection. If the input to such a projection is non-empty, then
the result is a relation of degree zero with one tuple (the tuple of
degree zero). One could try and visualise this as a table with one row
and no columns. If the input to such a projection is empty, then the
result of the projection is the empty relation of degree zero (which
one could try to visualise as a table with no rows and no columns).
Sometimes, it is interesting to be able to "derive" values for additional attributes in a relation, based on the attributes that are already there. Application programs may benefit if the dbms can handle these kinds of "derivation" for them, rather than having to do it themselves. This is precisely what the extension operator can do. The extension operator takes a relation as one argument, and a set of extension specifications as its second argument. An extension specification consists of both an attribute name and an expression that will determine the values of the additional attribute for each tuple of the relation argument : for each tuple in the relation argument, every expression in the set of extension specifications is evaluated. For each such expression, the value obtained is "appended" to the tuple as the value of the attribute the expression corresponds to.
Here is an example of an extension that computes a value for an AMIHAPPY attribute, based on the value that appears in each tuple for the GOTOSCHOOL attribute :
EXTEND(R
, AMIHAPPY(NOT(GOTOSCHOOL)))
DAY | GOTOSCHOOL | AMIHAPPY |
---|---|---|
Monday | true | false |
Tuesday | true | false |
Wednesday | false | true |
Thursday | true | false |
Friday | true | false |
A few rather obvious rules apply to the extension specifications : it is obviously illegal to specify an attribute name that already exists in the input relation (otherwise the result would contain two attributes of the same name, which is against the rules that the RA imposes on its relations), and it is equally illegal to specify an attribute name twice in the set of extension specifications, for the same reason.
Like with restrict, just any expression can be specified, just as long as it does not contain references that the system is unable to resolve. In particular, literals are also valid expressions in an extend specification, as are expressions that somehow refer to other database relvars.
EXTEND(R
, LETTERS(LENGTH(DAY) - 3))
DAY | GOTOSCHOOL | LETTERS |
---|---|---|
Monday | true | 3 |
Tuesday | true | 4 |
Wednesday | false | 6 |
Thursday | true | 5 |
Friday | true | 3 |
Care must be taken, however, to avoid the emerging of runtime exceptions, such as division by zero. If an extend expression involves an operator invocation that potentially gives rise to such exceptions, it is the programmers responsibility to take all necessary measures to prevent/avoid them.
Attribute renaming is an operator that is very similar to projection and extension, and can in fact even be considered as a (very particular) combination of both. The prime reason why it exists as a separate operator nevertheless, is that it is rather frequently needed in combination with the join operator (which will become clear when we discuss that operator).
Its operation should be clear simply by its name : it does nothing than produce a relation value that has exactly the same set of tuples (at least as far as the attribute values they hold are concerned), but where some of those attribute values now appear under another attribute name.
Formally, that is realised by giving as the operator's second argument a set of rename specifications. Each rename specification consists of two attribute names, the first one of which must refer to an existing attribute in the input relation, and the second one of which defines the name under which the concerned attribute values will appear in the result of the operator's invocation.
An example to rename the DAY attribute to
( RENAME(R
, (DAY,DAYOFWEEK))
)
:
DAYOFWEEK | GOTOSCHOOL |
---|---|
Monday | true |
Tuesday | true |
Wednesday | false |
Thursday | true |
Friday | true |
The union operator is intuitively easy to understand, because of its similarity to the union operator of classical set algebra. The union operator of the RA takes two relations as argument, and yields a relation that contains all and exactly those tuples that appear in either of the two input relations. Observe that of course once again, the relational requirement of "no duplicate tuples" holds. Observe too that the heading of both arguments must be identical, for the result to be a valid relation. This means that the relation arguments of the union operator must have the same number of attributes and the same set of attribute names, and for each pair of equally-named attributes, the type of that attribute must also be the same. Relations for which this condition holds are generally said to be "union-compatible".
The relational intersection operator closely resembles its counterpart of classical set algebra. It takes two relations as input value, and yields a relation containing all and exactly those tuples that appear in both arguments. Observe that this implies that both arguments must be union-compatible.
As is the case with union and intersection, so does the relational difference operator resemble very closely its counterpart in classical set algebra. As can be expected, the relational difference operator takes two relation values as argument, and yields a relation containing all and exactly those tuples which do appear in the first argument (the minuend), but do not appear in the second (the subtrahend).
The join operator implemented by SIRA_PRISE is the natural join operator. This operator is defined as follows :
If one relation has heading {a, b, c, ..., m, n, ...}, and another relation has heading {m, n, ..., v, w, x, ...}, then the natural join of these relations is a relation with heading {a, b, c, ..., m, n, ..., v, w, x, ...}, i.e. all the attributes that appear in the heading of either relation also appear in the heading of the join (but in line with the prescriptions of the RA, they do so only once).
A tuple {a1, b1, c1, ... m1, n1, ..., v1, w1, x1, ...} will appear in the resulting relation if (and only if) : (A) a tuple {a1, b1, c1, ... m1, n1} appears in the one argument of the join, and (B) a tuple {m1, n1, ..., v1, w1, x1, ...} appears in the other argument of the join.
Observe that this definition holds equally well for the case where the set of attributes {a, b, c, ...} is actually empty (argument 1 has all attributes in common with argument2), the case where the set of attributes {m, n, ...} is actually empty (arguments have no attribute in common), and the case where the set of attributes {v, w, x, ...} is actually empty (argument 2 has all attributes in common with argument 1).
The set of attributes {m, n, ...} is commonly referred to as the set of "join attributes". It follows from this definition that the set of join attributes is determined by (and only by) the heading of its arguments. Furthermore, the condition for "joining" two tuples is that all values in those two tuples appearing for a join attribute, must be equal. Natural join does not allow to specify that the "join condition" is satisfied if join attributes are unequal.
If you're used to using SQL, those two properties may seem like a rigid restriction, but they can be easily overcome, and both by using the RENAME operator mentioned earlier. Renaming a join attribute in only one argument has the effect of "moving the attribute out of the set of join attributes", which already solves the first problem.
As for the second problem, assuming that we wanted a join condition to be something similar to 'A.m1 > B.m1', we can simply rename the m1 attribute in one of both join arguments to move it "outside the set of join attributes", and then apply the greater-than condition in a restrict applied to the result of the join. This need not imply that the join is first computed completely before the restrict condition is applied.
The following examples illustrate the characteristics of the natural join operator by showing the result when our example relation is joined to the relation mentioned in the example :
R joined to ... | Result | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
Occasionally, businesses are interested in things like "the list of customers who already have all possible types of insurance policy", or "employees who have followed all training programs they were eligible for", or some such. Ever wondered how you would write such a query in SQL, without using any program code ? To get to the solution, a little rephrasing of the stated problem (e.g. the cusomters with all types of insurance policy) will be helpful : "the list of customers for which no insurance policy type can be found such that that customer does not have a policy of that particular type". The SQL for this might look something like :
SELECT ... FROM CUSTOMERS WHERE NOT EXISTS (SELECT ... FROM POLICYTYPE WHERE NOT EXISTS (SELECT ... FROM POLICY WHERE POLICY.CUSTNO = CUSTOMERS.CUSTNO AND POLICY.TYPE = POLICYTYPE.TYPE))
Queries such as these constitute an invocation of the relational division operator. This example sufficiently reveals that the relational division is a rather complex operator. Furthermore, it is an operator that isn't needed all that often, and is therefore usually not easily understood. The complexity of its semantics are also revealed by the fact that it needs three relation arguments to work on (inspect the SQL version closely to verify this).
In order to explain the operator as intuitively as possible, we name the three relations that are an argument to the division the subject (customers in the SQL example), the intersection (policies in the SQL example) and the target (policy types in the SQL example).
Essentially, what the operator does, is to process the subject relation on a tuple-per-tuple basis. For each such tuple (e.g. each customer), it looks up (in the intersection relation) a "complete" set of information relating to that subject tuple (e.g. the "complete" set of all policy types that this customer has subscribed), and then checks if that "complete" set of information is a superset of the target (e.g. all possible policy types), i.e. whether all entries in the target can also be found in the "complete set of information" from the intersection relation. If it is, then the subject tuple is included in the result of the division, otherwise it is not.
Given that the operator needs to deal with "sets of information", we will later see that there are better ways to deal with this kind of problem.
You might wonder why we want the check against the target to be a superset test, and not an equality test. After all, surely there can't be policies of a policy type that doesn't exist, can there ? Possibly true, but the target might also be "all policy types that already have been commercialised by our company for more than 20 years". In which case there could definitely be a policy type subscribed by a customer that does not appear in the target relation of the division.
There actually are quite a lot of subtleties such as this one related to the definition of the division operator. Others are to do with restrictions that are either or not imposed on the heading of the three arguments.
Which is yet another reason why the operator is rarely talked of. In fact, the operator is mentioned here more for completeness' sake and to draw the reader's attention to the class of problems it was aimed to solve. SIRA_PRISE currently doesn't support the operator.
In our discussion of join, we have seen that the heading of the result is the union of all attributes that appear in either one of the join arguments. Quite often though, this is not exactly what we need. Quite often we are only interested in knowing for which tuples of one relation, a "corresponding" tuple can be found in another relation, without needing to know the actual contents of that corresponding tuple.
This is why the semijoin operator is usually defined as a separate operator, and indeed supported by SIRA_PRISE. The result of the semijoin of two arguments is the same as if the join of the two arguments were computed, and that result projected back again to just the attributes of the first argument of the join.
The following table repeats the examples already given with join, but now displays the results for a semijoin of the arguments :
R semijoined to ... | Result | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
||||||||||||||||||||
|
|
||||||||||||||||||||
|
|
||||||||||||||||||||
|
|
||||||||||||||||||||
|
|
Just as we often find ourselves confronted with the need to know for which tuples of a relation there exists a corresponding tuple in another relation, so do we find ourselves often confronted with the need to know for which tuples of a relation there does not exist a corresponding tuple in some other relation. This is precisely what the semidifference operator does.
The result of the semidifference operator is equal to the (relational) difference of the first argument and the semijoin of the two arguments.
Have you ever needed to process a bill-of-materials structure, trying to find out, e.g. all the parts in which a certain given part is used, either directly or indirectly ? Then you probably know the problems you get confronted with trying to solve this with SQL. There's no problem finding out in which parts y a certain given part x is used directly. There's no problem finding out all the parts z in which any part y is used in which the given part x is used. There's no problem finding out all the parts x in which any part z is used in which any part y is used in which the given part x is used. There's no problem computing the union of all of these. The problem is where to stop.
In the early days, you either had to choose some upper limit, but then if the real number of "levels of containment" in the database exceeded that limit, your query produced incorrect results. Or else you had to code an iteration process yourself until you no longer found any part in which any part was used in which any part was used in which any part was used in which part x was used ...
Later, SQL came up with the possibility of writing recursive queries, relieving the programmers of having to write the iteration process themselves. But I think most of you who have already written recursive queries will agree that writing these is still rather unintuïtive and error-prone. At least, that's my personal opinion.
The operator of the RA that is "underneath" all of this, is called transitive closure. Its operation is best explained with an example, using the following relation with two attributes :
ATTR_A |
ATTR_B |
A |
B |
B |
C |
D |
C |
C |
G |
First, the intermediate result is set to this relation value.
Then, all possible pairs of tuples from this relation are compared, and if ATTR_B in the first tuple is equal to ATTR_A in the second, then a tuple is added to the result consisting of ATTR_A of the first tuple and ATTR_B of the second (e.g. a tuple (A,C)). This gives the following intermediate result :
ATTR_A |
ATTR_B |
A |
B |
B |
C |
D |
C |
C |
G |
A |
C |
B |
G |
D |
G |
This process is then iteratively repeated until no more tuples have been added to the intermediate result. In the example here, only one more iteration would be needed to add the tuple (A,G).
Time to revert to the bit of theory from the introductory chapters. In section 2.5.2, it was already established that a Generalized product, and thus a relation too, can be considered as a type in its own right. There is an important consequence : we can use that type to define a relation heading and declare a certain attribute in that heading to be of this type ! The value that such an attribute will take on, will be a tuple value specifically (remember that a generalized product/relation is a set of tuples, so if the relation is the type, the values drawn from it must be tuples).
However, this theoretical possibility is actually not all that interesting. There is little (actually nothing) that one can represent using a single tuple that one cannot represent using the needed number of separate attributes. It is far more interesting to have the possibility to have attributes inside a tuple that take on a relation value. Indeed, there is one thing that a relation value can represent and that other representations cannot handle, and that is the empty relation (or empty set).
But if we want that possibility, what type do we draw those values from ? We haven't encountered any kind of set yet that had relations as its members. We do need such a set if we want relations to be drawn as values from a type. That set is not so difficult to define : given a generalized product, we can "construct" the set of all possible subsets of that generalized product. Remember that since we named such a subset a relation, we have actually defined a set of all possible relations of a certain heading.
Relations are, themselves, values too, and relation attributes can therefore be declared to be of another relation type. Such attributes are called 'Relation-valued attributes' (RVA's for short).
In the RA, two operators are available that allow us to manipulate relations in connection with RVA's : GROUP and UNGROUP. The former essentially serves the purpose of "creating" relations that have a relation-valued attribute, and UNGROUP essentially does the opposite.
GROUP takes as its first argument a relation value, and as its second argument a "grouping specification". This "grouping specification" consists of an attribute name for the relation that the GROUP operator will produce, and a set of attribute names that appear in the input relation.
GROUP will produce a relation of degree n+m, where n is the number of attributes that have not been mentioned in any grouping specification, and m is the number of grouping specifications. Observe that n can be zero. The n attributes that have not been mentioned in any grouping specification will also appear in the heading of the relation produced by GROUP. The other m attributes in the result of GROUP will have the name specified in the grouping specification, and be relation-typed (i.e. those will be RVA's in particular). For each of these m relation-typed attributes, the heading of their relation value will consist of all and exactly those attributes from the (heading of the) input relation that were specified in the grouping specification for that relation-typed attribute.
If we take the example relation from the TCLOSE operator (the relation produced as the result in that example), then a group specification might be : DESCENDENTS(ATTR_B). This will produce a relation with two attributes : ATTR_A and DESCENDENTS. The former has the same type as in the input relation, the latter is relation-typed. The heading of the relations held as values in the DESCENDENTS attribute, consists of the attribute ATTR_B, which has the same type as the input relation.
Grouping that relation with this grouping specification, will yield the relation :
ATTR_A |
DESCENDENTS |
A |
ATTR_B B C G |
B |
ATTR_B C G |
D |
ATTR_B C G |
C |
ATTR_B G |
Without attempting a complete formal definition, it should be intuitively clear from this example how the value of the DESCENDENTS attribute is computed.
You might suspect a similarity here with SQL's GROUP BY ... construct here. That is not quite correct. The SQL construct corresponds more to the SUMMARIZE operator, which will be explained later. "Common" SQL actually does not have a counterpart to the GROUP operator, because it has no way to expose a table as an attribute value to the user. As far as the link to SQL isimportant, the result of the GROUP operator of the RA is actually to be regarded more as an "intermediate result" in the computation of an SQL GROUP BY ... result.
An important point to make about the GROUP operator is that it can provide a substitute for the DIVIDEBY operator discussed earlier. Setting up an expression using the GROUP operator that produces the same result as some invocation of DIVIDEBY is still somewhat tricky and elaborate, but it is certainly doable.
It will now also be intuitively clear what the purpose is of UNGROUP : that is precisely to "revert" relations such as those produced by GROUP, to a relation that does not have any (or certainly less) relation-typed attributes.
There are certain relation values that GROUP cannot produce, but for which UNGROUP nevertheless has well-defined results. In particular, GROUP can never produce a relation value that has a relation-valued attribute for which the empty relation appears as a value in the result of the GROUP invocation. UNGROUP must, however, be able to deal with such relations. In particular, it will do so by not including a corresponding tuple in its result.
The aggregation operator is not commonly defined as a separate operator of the RA. It is specific to SIRA_PRISE, and the reason is that, despite the fact that it is basically just a building block on which the summarization operator builds, it may nevertheless in certain specific cases be useful to expose aggregation as a separately existing operator to the user. Summarization also becomes easier to explain once aggregation has been defined.
The aggregation operator takes one relation as an input argument, and a set of aggregation specifications as its second argument. Each aggregation specification consists of an attributename for the relation produced by the operator, the name of an associative operator that will produce the aggregated attribute value in the resulting relation, and an expression defining the value to be aggregated.
The aggregation operator produces its result by proceeding as follows :
First, for each aggregation spec an intermediate result is initialized to the identity value of the associative operator named in the aggregation spec.
Then, the entire input relation is processed on a tuple-per-tuple basis, and for each tuple :
The value of each expression in an aggregation spec is evaluated for that tuple, and that value (v) is then "aggregated" into the intermediate result by invoking the associative operator named in the aggregation spec with as arguments both the intermediate result and the obtained value v.
When all tuples of the input relation are processed, a singleton relation is produced with one attribute per aggregation spec, where the value for each attribute is the last "intermediate" result of that aggregation spec.
E.g. an aggregation spec COUNT(PLUS(INT(1))), will evaluate the literal value 1 for each tuple, invoke the PLUS operator for each tuple, and produce a tuple holding an attribute COUNT whose value is equal to the last "intermediate result" obtained (which will thus be equal to the number of tuples in the input relation).
Likewise, an aggregation spec COUNT(PLUS(ATTR)), will evaluate the attribute ATTR for each tuple and sum those.
Summarization is basically the operator underlying SQL's GROUP BY construct. In terms of the RA, it can be defined as a combination of invocations of GROUP and AGGREGATE.
Summarization takes three arguments : a relation R to be summarized, a set of attributenames A defining which "groups of tuples" of R will be summarized together, and a set of summarization specs S which are completely identical in nature to the aggregation specs presented in the foregoing section.
Summarization essentially obtains its result by performing the following steps :
First, GROUP R such that all attributes mentioned in A are retained, and all the other attributes of R become a member of a single relation-valued attribute RVA.
Then, process the result of that grouping on a tuple-per-tuple basis, where for each tuple, the relation value of its RVA attribute is aggregated using the given summarization / aggregation specs, and a resulting tuple is produced with all attributes in A, plus the RVA attribute, whose value will be the singleton relation obtained from the aggregation.
Then, that relation is UNGROUPED back again. The result will thus be a relation whose heading has all attributes mentioned in A, plus one attribute per summarization spec. The values for these latter attributes will be values that have been computed by the aggregation mechanism.
As an example, we give the result that would be obtained if a summary was made of the resulting relation of the TCLOSE example, giving only ATTR_A as grouping attribute and COUNT(PLUS(INT(1))) as summarization spec :
ATTR_A |
COUNT |
A |
3 |
B |
2 |
D |
2 |
C |
1 |
The relational operators union, intersect and join have been defined here in such a way that they accept exactly two arguments. However, it happens to be the case that these operators have the mathematical property of being associative. SIRA_PRISE therefore allows invocations of these operators to have more than two arguments.
The theoretical possibility of invocations of these operators with only one argument, or with no arguments at all, is syntactically not supported by SIRA_PRISE. If these operators are invoked in the context of an aggregation or summary, then SIRA_PRISE will compute the results correctly.
It's quite likely that you have already run into the need for your databases to keep track of historical data. It's an old problem, and existing technology does not really offer a comfortable or easy way for programmers to deal with it. Note very carefully that I haven't said "the Relational Model does not offer ...". That is due to the fact that the relational model does offer an adequate solution for the problem, it's just that the currently existing implementations of the model don't offer an adequate solution.
If you have had to design a database that kept track of historical data, then your solution probably was to "add a startdate to the table, and include that in the key". You may or may not have included an enddate in the table too, and depending on whether or not you did, found yourself confronted to either one of the following two problems for which no easyly written and easily maintainable solution exists :
If you do include an enddate, how will you prevent some row in your table from "overlapping" with another row (and possibly even contradict with that other row).
If you do not include an enddate, what is the query that will give you the applicable enddate for a certain row, if requirements are such that both the start date and the end date are needed ?
In general, you will probably have found that writing queries becomes a very tedious and error-prone process, right up to the point where writing correct queries actually becomes a real nightmare.
There does, however, exist a much better solution to the problem. That solution is explained thoroughly in the book "Temporal Data and the Relational Model". This book is strongly recommended reading for any programmer who has to deal with historical data. The solution is based on the usage of what is called "interval types".
SIRA_PRISE implements a solution based on interval types that is inspired by said book. There does exist a significant difference in approach between the book and SIRA_PRISE, but that difference does not show up in the results that the relational operators produce. So, as far as the relational operators (as implemented in SIRA_PRISE) themselves are concerned, I believe they behave identically to the behaviour depicted in the book. Which is what matters.
The word 'interval' may vaguely ring a bell. You may remember having been taught about them in your math classes, and maybe wondered what good it all was. Well, the 'interval type' used here is exactly the same kind of thing.
Consider two attributes in a relvar called 'STARTDATE' and 'ENDDATE' respectively, both declared to be of type DATE. The rest of the attributes contain the "actual" information, e.g. :
NAME |
DISCIPLINE |
STARTDATE |
ENDDATE |
Bob Beamon |
Long Jump |
1968-07-24 |
1998-06-18 |
You can probably guess what this row intends to declare : Bob Beamon held the world record Long Jump from 1968-07-24 to 1998-06-18. Now, what "meaning" can be ascribed to these dates, what "things" do these dates tell us ? Two are rather obvious : one being that way precisely on july, 24 of 1968, Bob Beamon acquired the world record, and another one being that precisely on june, 18 of 1998, he lost the world record to someone else who did better. But there's a third thing they tell us, and that's about all the dates that are "between" these two dates : we can say Bob Beamon was the world record holder on all these dates in between, too.
In that sense, the two dates are somehow "closely related", or it might even be said that in that sense, they actually "belong together". Expressing that "belonging together" is precisely what is done by representing these dates as an interval :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1998-06-18 |
Note that this table now has only three columns, representing the fact that there now are only three attributes in the relation. Note also that this 'DURING' attribute is not an attribute of type DATE (indeed it does not contain a valid date), but rather of type 'INTERVAL of DATE'.
Informally, one can think of this interval as "representing a set of dates", where this set has the particular property that there does not exist a date that is "in between" the start and end dates, and where that date is not a member of the set. In other words, the set represented by this interval is "contiguous", meaning that all the dates it contains form a contiguous series.
One might think that, by replacing the two DATE attributes with one single INTERVAL value, we have lost the possibility to obtain the actual date when Bob Beamon acquired his world record. Indeed, there no longer is an actual date value in the relation. Obviously, this is undesirable, and this must be catered for. That is done by defining a set of operators to deal with values of an interval type which allow us to manipulate these interval values. The most important of them are : STARTOF, ENDOF, UNION, INTERSECT, and MINUS. Another important class of operators in connection with interval types are known as "Allen's operators", after the man who first proposed them. They allow to verify whether two interval values are adjacent, overlapping, disjoint, contained within one another, etc. etc.
STARTOF will "retrieve" the start date from an interval value, ENDOF will "retrieve" the end date from such an interval value. The other three are the obvious equivalents of the like-named operators of the set algebra (this makes sense since an interval can indeed be thought of as being a set). Obviously, the interval versions of these operators must ensure that the value they produce is indeed still an interval (e.g. the set union of two "contiguous" sets is not necessarily itself a "contiguous" set).
The information contained in the 'world-records' example can be represented in a multitude of ways, of which only the simplest was given. Another relation representing exactly the same information is, e.g., the following :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1995-03-10 |
Bob Beamon |
Long Jump |
1992-01-12 - 1998-06-18 |
Yet another one might be, e.g. :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1995-03-10 |
Bob Beamon |
Long Jump |
1995-03-10 - 1998-06-18 |
In the first example, the two tuples both ascribe the world record for the same discipline to the same person, for two different periods, but those periods overlap. In the second example, once again the world record for the same discipline is ascribed to the same person for two different periods, and allthough they no longer overlap, it is still the case that the two periods together form one single, contiguous, span of time.
In the first example, that causes a form of reduncancy to appear in the database : for the period from 1992-01-12 to 1995-03-10, there are two distinct tuples each saying who held the world record in that particular period. If someone later discovered that this information is actually incorrect, then two distinct tuples would have to be corrected. All such forms of reduncancy are certainly undesirable. The second example does not suffer from this kind of redundancy, but nevertheless it is still a "more complicated" way of saying the very same thing compared to what is also said by the single-tuple example at the beginning of this section. More complexity without additional value, is also undesirable.
So it is therefore, in general, desirable that the interval values in the database are "as large as possible", without losing or corrupting any information. If a relation heading has at most one interval-typed attribute, then there is exactly one relation that satisfies that criterion. That relation is said to be in "packed form". Relations in packed form have the desirable properties that there are no points within the intervals for which the same logical proposition is implied by two distinct tuples, and that there are no two distinct tuples that can be "merged into one" whilst maintaining the same "information set".
Since these properties are desirable, it is also desirable that the relational operators always produce relations that are in packed form. The following sections will explain how this affects each of the relational operators.
Restriction is only affected by the existence of interval types, by the set of operators that can be invoked in its boolean expression. If a relation heading has an interval-typed attribute, then it is only obvious that it should become possible to invoke, e.g., the OVERLAPS() operator using that attribute as one argument and using some other interval value as the second.
Restriction cannot produce a non-packed relation if its input relation is itself packed.
The same obviously does not hold for projection. Assuming the world-records relation was the following value :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1998-06-18 |
Mike Powell |
Long Jump |
1998-06-18 - |
Then a projection of this relation over attributes DISCIPLINE and DURING could yield :
DISCIPLINE |
DURING |
Long Jump |
1968-07-24 - 1998-06-18 |
Long Jump |
1998-06-18 - |
This relation is not in packed form. The projection operator will therefore not produce this relation value as its result, but rather the following :
DISCIPLINE |
DURING |
Long Jump |
1968-07-24 - |
These operators are not affected by the possible appearance of interval values in the attributes of the relations they operate on, except then (in the case of extend) in the set of operators that can be invoked by the extend expressions.
The union operator can obviously cause an overlapping (or adjacency) of tuples in its output, if no special measures are taken. SIRA_PRISE's implementation of the operator does take these special measures.
Supposing that the world-record history for each discipline is kept in distinct relations, we might have the following relations (for, say, long-jump and 100m respectively) :
NAME |
DURING |
Bob Beamon |
1968-07-24 - 1998-06-18 |
NAME |
DURING |
Bob Beamon |
1968-02-24 - 1969-05-05 |
It is obvious that queries of the nature "Which athletes have been the world-record holder for both the long-jump and the 100m ?" should be answered using the intersection operator. But the intersection operator as defined before, would produce an empty relation, because there is no tuple that appears in both relations ! Indeed, the DURING attribute in both tuples is unequal. This problem is dealt with by saying that when the intersection operator is applied to relations holding interval-typed attributes, it is sufficient that the interval-typed attribute values overlap (instead of having to be equal), and that the value returned for such an attribute is the interval intersection of the attribute values in the "intersected tuples".
In the example, this would yield the relation :
NAME |
DURING |
Bob Beamon |
1968-07-24 - 1969-05-05 |
And that's indeed the period during which, according to the relations of the example, Bob Beamon held the world record in both disciplines.
(Allthough it has nothing to do with the RA per se, this example also illustrates an issue that is crucial enough in temporal database design to mention it explicitly here : whether this result is indeed the correct answer to the stated problem, is somewhat doubtful. The stated problem was "Which athletes have been the world-record holder for both the long-jump and the 100m ?". Note carefully that there is no mention of a requirement to hold a world record in both disciplines _at the same moment in time_. Nor is there any mention of a relaxation that both world records must not have been held _at the same moment in time_. This issue has more to do with how we express our queries in natual language (and how apt or inapt natural language happens to be for that purpose), but especially when dealing with temporal information, it is an important potential source of misunderstandings between users and IT people.)
To revert to our actual subject, we continue with a question that might have arisen. If the intersection operator is "re-defined" to treat interval values by checking, not whether they are equal, but instead only whether they overlap, then what do I do if I really want the check to be for equality ? In our example, how do I get the intersection operator to produce an empty relation value, precisely on account of the fact that the intervals are not equal ? It will not usually be the case that this behaviour is the needed or desired one, but it cannot in general be excluded either.
So obviously, the conclusion seems to be that it must be possible for the user to specify which of the two conditions ("overlap" vs. "equal") must be used for some particular invocation of the operator. Furthermore, this must be possible for each individual interval-typed attribute that the relations have. Indeed, it might be the case that there is more than one interval-typed attribute in the relation, and that for some particular invocation of intersect, the "equality" behaviour is desired for one of those attributes, but the "overlaps" behaviour for the other.
It will now be intuitively obvious how the relational difference operator deals with interval types. Supposing the world-record relations are as in the intersection example, but the question is now "Which athletes have been (and during which period have they been) world-record holder for the 100m, but not for the long-jump ?".
The answer to this question is given by invoking the relational difference operator, that is defined to deal with interval values such as to return :
NAME |
DURING |
Bob Beamon |
1968-02-24 - 1968-07-24 |
So the relational difference operator deals with interval values by computing their interval difference, and including a tuple in the result only if that interval difference "still contains at least one point".
If in an invocation of the join operator, the set of join attributes contains an interval-typed attribute, then join deals with that attribute in exactly the same way as defined for intersection : it is not necessary for the interval values to be equal for a tuple to be included in the result of the join, instead it is sufficient that they overlap (i.e. have at least one value of their "underlying type" in common). The value returned for that attribute in the result of the join is the interval intersection of both interval values.
Division is a problematic operator to define a "temporal" counterpart for. The most important reason for this state of affairs being, I believe, that there is no single, unique sensible way to incorporate the "temporal" aspect into the formulation of the kind of queries that relational division was aimed to solve.
Recall from 4.4.9 that division was aimed to solve the kind of query like "get the list of customers who already have all possible types of insurance policy", or "employees who have followed all training programs they were eligible for", or some such.
There is no single and universally applicable answer to the question "Must those customers have had all possible types of policy at the same point in time, or is it sufficient that they have had at least one policy of all possible types at just any point in time -not necessarily the same point in time- ?". Nor is it obvious whether "all possible types of policy" refers to "all types of policy that were possible at one particular moment in time" or to "all types of policy that have been possible throughout all times", or still some other different possible interpretation.
For this reason (and possibly still others), a temporal counterpart to relational division is simply not defined. The relational division operator, if, when and where supported, therefore operates on relations holding interval-typed attributes in exactly the same way as it does on relations not holding such attributes. It is, however, very unlikely that the results produced by relational division when interval-typed attributes are in play, can make very much sense.
In section 4.4.10, the semijoin operator was defined to be equivalent to the projection of a join of the two arguments. The semijoin of relations holding interval-typed attributes is defined in the same way. But note that the result of the operator is influenced by how the "join" part of the operator deals with interval-typed values. Note that despite the fact that the "outer" operator of the equivalent expression is projection, which potentially gives rise to non-packed relations, this phenomenon cannot occur with semijoin.
Like semijoin, the definition of semidifference does not need to be altered for dealing with interval-typed attributes.
Defining a "temporal counterpart" for transitive closure is somewhat subject to the very same problem as defining a "temporal counterpart" for relational division.
So despite the fact that "parts explosions" and similar problems regularly show up in the IT world, and despite the fact that "attaching a temporal connotation" to this kind of information is quite conceivable, a "temporal counterpart" of the operator that is aimed to address this category of problems is currently not defined.
The group operator is affected only by the possible occurrence of interval-typed attributes if there are multiple group specifications. Consider the following hypothetical relation value :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1998-06-18 |
Bob Beamon |
100m |
1968-02-24 - 1969-05-05 |
If this relation were grouped (without "special measures" to cater for the interval-typed attributes) per NAME, with the two grouping specificiation RVA_DISCIPLINE(DISCIPLINE) and RVA_DURING(DURING), then the relation-valued attribute RVA_DURING would contain a relation with two tuples, each holding a DURING value that overlaps. So the (relation-valued) RVA_DURING attribute in the output would not be in packed form.
Ungroup is also affected in the multi-element ungroup version. Consider following relation :
NAME |
RVA_DISCIPLINE |
RVA_DURING |
Bob Beamon |
rva_dis1 |
rva_dur1 |
Bob Beamon |
rva_dis2 |
rva_dur2 |
where the respective relation values of the RVA_ attributes are as follows :
rva_dis1
DISCIPLINE |
Longjump |
100m |
rva_dis2
DISCIPLINE |
Longjump |
rva_dur1
DURING |
1968-07-24 - 1998-06-18 |
rva_dur2
DURING |
1968-02-24 - 1969-05-05 |
Note that this relation cannot possibly be produced as the result of an invocation of the group operator. Nevertheless, under some interpretations, it is possible to define a result for a multi-element ungroup invocation on this relation.
Under such an interpretation, overlapping tuples could potentially be produced for the longjump discipline :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Longjump |
1968-07-24 - 1998-06-18 |
Bob Beamon |
Longjump |
1968-02-24 - 1969-05-05 |
Bob Beamon |
100m |
1968-02-24 - 1969-05-05 |
To repeat, such a multi-element version of ungroup is not commonly agreed upon.