|
What, why & who for ?
This document essentially contains an introduction to the relational algebra that is implemented in SIRA_PRISE. Depending on your background, you may already have some or very extensive knowledge about the subject, in which case you probably do not need to read this document.
If, however, your knowledge of and about the Relational Model of Data was primarily/solely obtained through the usage of SQL-based systems, and terms such as "relvar", "relation" and "tuple" sound completely unfamiliar to you, then carefully studying this document is probably a sine qua non for you to
understand and effectively use this new system you've got in your hands.
This introduction is subdivided in several parts :
- The first part, "Terminology", defines the components that lay the foundation for any relational database : type, value, tuple, relation, ... It is rather mathematical in nature and may therefore seem to be a bit alienating ("What the hell do IT professionals need
mathematics for ?"), but let me assure you that understanding this part really is relevant and really is needed for you to understand part two properly.
- Part two, "Relational Algebra", offers a detailed definition and discussion of all the operators of the algebra, which is the part of the relational model that every database user employs in every single query he/she writes and/or executes.
- Part two introduces the operators of the algebra, making abstraction of a very particular data type called the interval type. SIRA_PRISE offers support for interval types, and the third part, "The RA and interval types", introduces how the operators of the algebra behave
when faced with relations that hold interval-typed attributes.
Terminology
You may have wondered, at some time, why "Relational databases" (and "Relational dbms's") are called how they are. You may have assumed that it must have had something to do with this thing called "Entity-Relationship modeling" that you may already have practiced at some point in your career. Your
assumption was wrong. In fact, the first mention of the term "relational" in the context of data management even predates the invention of E/R modeling ... So where dóes the term come from ?
Well, the term comes from something that we have presumably all been taught about in our math classes, somewhere between the age of 10 and 20, that thing being the "relation". The Relational Model of Data (RM for short) is no more or less than a mechanism to represent data in general, and which has that
same mathematical concept of a relation as its key building brick. Before digging into the details of the algebra, it is first necessary to define a couple of terms that are frequently used when talking about the RM. That is where we will start this introduction of the relational algebra.
-
Set
Lots of people collect things. Stamps, coins, rare books, art reproductions or even original art masterpieces, just name it and someone exists who collects it. You probably have spent some time in your life too, trying to build up some collection. If you have, doubtless that there have been some items
that you have managed to collect more than once. It was problaby fun trading those "duplicates" with fellow collectors for items that you had not yet managed to collect. Well, a "collection" is a very fundamental concept in mathematics, one that is in fact needed to define what a "set" is.
In mathematics, a "collection" is, loosely speaking, any possible gathering in any possible quantity of any possible things. The gathering of (chair, cloud, oak tree, chair) is a collection. Note that the collection has two chairs.
Now, if a collection has the particular property that none of the items in the collection appear more than once, then we call this collection a "set". This property turns out to be a very useful and important one, and therefore sets, and set theory, are very important in mathematics. So important in fact
that relational database technology could and would not even have been invented without them. Important enough also, to be repeated in bold :
A set is a collection that does not contain duplicates. Something that does contain duplicates, is not a set.
-
Type/Value
-
Definition
So {A,1,oak tree,patent} is a set. {A,patent,A} is not. Now, in the context of data management problems, which is what we're interested in, sets such as those are of course not very useful. In the context of data management, sets are mostly useful only if the items they contain are in some way
"alike". In that context, we usually deal with sets that contain only numbers, only dates, or some such. Useful sets in data management are sets such as {Monday, Tuesday, Wednesday, Thursday, Friday} and {true, false}. Without attempting a formal and precise definition of the term, we can (and do) say that a
Type is such a "useful" set. And a Value is one of the items contained in a Type. Which brings us to :
A Type is a set of Values.
-
Additional remarks
In data management, we will often need to speak about, or refer to, such Types. For that reason, it is customary that all the Types we use are assigned a name. The type {Monday, Tuesday, Wednesday, Thursday, Friday}, e.g. could be assigned the name 'Weekdays'. The type {true, false}, e.g. could be
assigned the name 'Boolean'. Such that we can say :
A Type is a named set of Values.
You might wonder at the mutual dependency in the definition here. A type is a set of values, and a value is an item in a type. In the end, that doesn't really define much, does it ? Shouldn't we have some other definition of what a 'Value' is ? Well, for the purpose of this introduction, that isn't
really necessary. The reason for this is that the RM (and the algebra) can work with just any set of types. The RM does not "fall apart" if the set of integers is not defined as a type, nor does the algebra. (It would be difficult to build a useful database without that type, but that's another issue.) And
the most important consequence is that this observation is equally valid "the other way round" : the RM does not stop you from defining whatever type you like. The RM gives you no answer to the question when or why something is a value, and when or why not. As far as the RM is concerned, a value is whatever
you want it to be, and a type is whatever set you define it to be. It only depends on whether you have a useful purpose for the type or not. A set such as {Em, C, G, ...} might look pretty insane and useless to most readers, but a guitar player would certainly have some good uses for a type holding the guitar
chords as values ...
-
Cartesian product
-
Definition
Looking at our example types Weekday and Boolean, it's not difficult to see how one can "pick a value from the first type, then pick a value from the second one, then arrange those two picked values to form a pair.". One example of such a pair would be (Monday, false). There are 10 such pairs possible,
and those 10 pairs can be gathered into a set. Such a set turns out to be a very important concept in mathematics too, so important in fact that the concept of such a set is given a name : the "Cartesian Product".
The Cartesian product of two sets is the set of all possible pairs formed from the items in those two sets.
-
Remarks
The term "Cartesian product" is not often encountered in the world of RM. The reason for this will become clear in the next few sections. Nevertheless it is worthwhile to reiterate in a more informal manner : A Cartesian product is a set, and the things that we find in that set are "pairs of other
things", and those "other things" are precisely the things that are part of one of the sets that were used to form the Cartesian product.
-
Relation (Traditional mathematics version)
-
Definition
Traditional set theory then goes on to define the concept of a "Relation" :
A Relation is a subset of a Cartesian product of two sets.
-
Remarks
Examples of such relations are then : {(Monday, false),(Monday, true)}, {}, {(Tuesday, false),(Thursday, true)}, ...
Once again it is worthwhile to reiterate in a more informal manner : A Relation is a set, and the things that we find in that set are "pairs of other things", and those "other things" are precisely the things that are part of one of the sets that were used to form the Cartesian product of which the
Relation happens to be a subset.
-
Generalized product
-
Definition
Following this definition, a relation can be (and is indeed) defined as being a set of pairs. For the context of data management problems, this definition isn't so useful, however. Indeed, when recording data about our customers, sales, ..., we usually record a lot more things than just two. So for the
specific purpose of data management, the concept of a relation had to be "adapted" somewhat ("generalised" is the correct word, actually) to cater for that problem. It is not difficult to see how this can be done : instead of "picking values from two types and arrange those", one can also "pick one value from
each of any number of types, and arrange those". The set that is formed by gathering all possible such combinations, is no longer called the "Cartesian" product, but rather the "generalized" product :
The generalized product of n types is the set of all possible combinations of values formed from those n types
-
Additional Remarks
The term "generalized product" is not often encountered in the RM world. It only serves as a step-up to the definition of what a Relation is in the RM world.
Once again it is worthwhile to reiterate in a more informal manner : A Generalized Product is a set, and the things that we find in that set are "combinations of other things", and those "other things" are, precisely, the Values that are part of one of the Types that were used to form the
Generalized Product.
An important remark is to be made here. The generalized product is defined to be a set, specifically. And when defining a Type, it was said that a type is a set, and that the items that a type contains are called 'Values'. So the question could arise, if a generalized product is a set, and all it takes
to be a type is to be a set of values, could we regard a generalized product as a type, and the items that the generalized product contains as values ? The answer to whether we can is obviously 'yes', but more important, is there any benefit in doing so ? This issue will be addressed when discussing a.o. the
GROUP operator of the algebra, but it is worthwhile to remember already now :
Each generalized product can be considered as a Type in its own right
-
Relation (Relational Model Version)
-
Definition
The concept of a Relation can now be generalized as follows (and it is this definition that is the cornerstone of the RM - and thus of the algebra too) :
A Relation is a subset of the generalized product of n types
-
Additional Remarks
Informally, this translates to "A Relation is a set, and the things that we find in that set are "combinations of other things", and those "other things" are, precisely, the Values that are part of one of the Types that were used to form the Generalized Product that the Relation happens to be a
subset of.".
It is also important to remember that, in the context of the Relational Model of Data, the word "Relation" always means the kind of set defined here, and not to the more "traditional" definition that we were/are usually taught in our math classes.
-
Tuple
-
Definition
The items found in a Relation (RM version) are then "pairs" having three, or twenty, or in fact any number of values. The name "pair" of course then becomes somewhat inappropriate to label such things, and of course, it is also desirable to just have a term, preferably other than "combination of n
Values", to use whenever we want to refer to these things that form a Relation. That term is "Tuple".
A Tuple is one item of/in the generalized product of n types,
A Tuple is a combination of values drawn from n types
-
Additional Remarks
First and foremost, we have already established that a generalized product can be considered as a type in its own right. Implying that the things that such a generalized product contains, are themselves 'values'. Given that we have just baptised these things as 'Tuples', we can conclude :
A tuple is, itself too, a value.
Given that a generalized product contains only tuples as its members, and given that a Relation was defined to be a subset of a generalized product, the following is obviously true :
A Relation is a set of tuples (formed from n types)
-
Application to data management
By now, the question may rise, "How is all of this useful to the field of databases ?". The answer to this question is best illustrated by revisiting the previously given WeekDay/Boolean example. The "Generalized Product" as previously defined, when applied to the Weekday/Boolean example can be
represented as follows :
Monday |
true |
Monday |
false |
Tuesday |
true |
Tuesday |
false |
Wednesday |
true |
Wednesday |
false |
Thursday |
true |
Thursday |
false |
Friday |
true |
Friday |
false |
One tuple of this "Generalized Product" can be represented as follows :
And one Relation that is a subset of this "Generalized Product" can be represented as follows :
Monday |
true |
Tuesday |
true |
Wednesday |
false |
Thursday |
true |
Friday |
true |
These representations may start ringing a bell. They all look like a table, and that's not coincidence. When doing your database work, you probably refer to these things as "row" and "table", which are the "synonyms" that were introduced by SQL (the language that you probably use -in some dialect- to
manage your databases) for the terms "tuple" and "relation", respectively.
For the purposes of the present document (which is to clarify the algebra), it is not that terribly incorrect to think of a tuple as "being a row" and of a relation as "being a table". However, the algebra is defined in terms of tuples and relations, not in terms of rows and tables (they are not truly the
same thing), so this document will consistently use the terms "tuple" and "relation".
Nevertheless, for a proper understanding of other textbooks that attach more importance to using formally precise terminology, it is indeed important for your attention to be drawn to
Some crucial issues
-
The SQL word "table".
In SQL, the "table" is undoubtedly a rather important concept. But what exactly does the SQL term "table" refer to ? You have just been presented three different things that are all three also tables, and only the third one of them would really correspond to the meaning that SQL attaches to the
word. The second one would rather correspond the the SQL term "row", and the first one wouldn't really correspond to any SQL term at all ... This only goes to show how SQL's choice of terminology leads to a regrettable blurring between concepts that are quite distinct, as far as the theoretical foundations of
the RM are concerned. The mathematical objects that have been presented here are really quite distinct, and in fact the only thing they have in common is that the same presentational technique (a table) might be used to visualise
them. But that does not make the visualised mathematical concepts themselves, identical.
-
A relation is a set ...
There is a very important consequence that derives from the way that the concept of a Relation is defined. A relation is a set, and a set is defined to be a collection that has no duplicates. See the consequence ? No tuple can appear twice in a relation. Or else it is no longer a relation.
SQL deviates from this principle, and you may wonder whether that is really so problematic. It is beyond the scope of this document to discuss the issue profoundly, suffice it to say that SIRA_PRISE faithfully follows the following rule (which stems from the way the RM is defined) :
No tuple can ever appear twice in a relation
-
... of tuples, formed from n Types
It might have gone unnoticed, but another crucial thing is the 'n' in the definition. In the example of the Weekday/Booleans, does the generalized product contain any tuple of only one value ? Or can it possibly contain any tuple in which, e.g., the day is _not_ a weekday ? Obviously, the answer to
both these questions is "no". And given that a relation is a subset of this generalized product, the same goes for any relation over those types.
SQL deviates from this principle by offering the possibility to include nulls in its rows. On this issue too, entire libraries have already been filled with the discussion of whether or not this is really a problem. Once again, it is beyond the scope of this document to add to that discussion, so once
again it is only observed here (and important to remember) that SIRA_PRISE faithfully applies the rules of the model :
In a Relation, all tuples contain an equal number of values, and that number of values is equal to the number of Types that the generalized product was defined with No tuple can contain any value that is not a value of the corresponding Type
-
Attributes
When we mentioned the definition of Cartesian product and the pairs of values that it consists of, we incorrectly spoke of the Cartesian product as being "the set of all such possible pairs". You may have noticed the error : the ordering of the values in the pair was overlooked. It was tacitly assumed
that the values picked from the "first" type would have to occupy the first position in the pair, and the values picked from the "second" type would have to occupy the second position in the pair. Otherwise it would also have been possible, in the given example, to construct pairs such as (false, Friday).
So in the way that mathematics usually deals with relations and such, the order of the values is important and relevant. Implying that if we are given such a pair, and we need to "extract" a value from such a pair, then we need to know whether it is the first or the second value we're interested in.
Given that we're interested in data management, and that we will often be using relations and tuples with a significantly larger number of values, is that an interesting way to work ? Let's imagine we have some generalized product of 20 types (think of this as the declaration of a 20-column table), and
all programs that operate on that table must refer to the values as "the sixteenth value", or "the thirteenth value" (for lack of another means to refer to those values precisely and unambiguously). Apart from the fact that it is mentally very tiring to write programs in that mode (causing way more
programmer errors), consider what would need to be done if at one point in time we no longer needed the "third" value, and wanted to remove the corresponding type from the generalized product ("drop the column from the table"). All references in all programs to values beyond the third would have to be
decreased by one ...
That's definitely not a good idea. Which is why the RM explicitly states that all values in a tuple must be associated with a name, and that that name should be the only way available to address that value.
You probably observed that the tables the previous example lacked something essential, and indeed you were right. What was lacking were precisely the names which are needed to establish "addressibility" of the values in the tuples. These names are called the attribute names. The table from the
examples should actually have looked more like the following :
DAY |
GOTOSCHOOL |
Monday |
true |
Tuesday |
true |
Wednesday |
false |
Thursday |
true |
Friday |
true |
Now, what does this prescription imply for the tuples that the algebra works with ? For the tuples and/or relations themselves, in fact it doesn't make so much of a difference. But it does for the way in which we represent those tuples and/or relations. When we have the attribute names, we can represent
a tuple of this relation in a non-tabular way too : {DAY:Monday, GOTOSCHOOL:true}. And it wouldn't make a difference if we wrote {GOTOSCHOOL:true, DAY:Monday} instead.
Likewise, the ordering of neither the rows nor the columns in the tabular representation of a relation is material. The following relation is exactly the same as the previous one :
GOTOSCHOOL |
DAY |
true |
Thursday |
true |
Tuesday |
false |
Wednesday |
true |
Monday |
true |
Friday |
In general, it can therefore be stated that :
The RM has no concept of ordering or ordinal position of attributes within tuples, no concept of ordering or ordinal position of attributes within relations, and no concept of ordering or ordinal position of tuples within relations.
Observe that the purpose of the attribute names implies that an attribute name cannot appear twice in a tuple. If an attribute name were allowed to appear twice in a tuple, then we would no longer be able to use that attribute name to uniquely identify the value of that attribute in a tuple. We can
therefore also say :
A tuple is a set of named attribute values
-
Some more terminology and properties
Having put all this in place, we can make a number of important observations and define a few more useful terms wrt the RM.
We have already observed that in a relation, each tuple has an equal number of attribute values (and that number is equal to the number of types that were used to define the generalized product). That number is typically referred to as the degree of the relation and its
tuples. SQL users may think of this as the "number of columns in a table".
Given that a relation is defined to be a set, it follows naturally that a number of concepts from set theory in mathematics apply equally well to relations. One such concept is the number of tuples that the relation has. In set theory, this number is called the cardinality of the set. The RM has retained the
term and speaks of the number of tuples in a relation as that relation's cardinality. (Since we have established that a tuple is also a set, it is theoretically possible to speak of the cardinality of a tuple, too. However, since this concept has already been baptised
"degree", that is usually not done.)
It will also be clear that in a relation, all tuples will not only hold an equal number of attribute values, but the set of attribute names used to refer to those values, will be the same no matter which tuple one looks at. It will furthermore be clear (even if it has not been stated explicitly) that the
values in any two tuples for a certain attribute name, will be values from the same type, specifically. There will be no two tuples where one has a value from GregorianDate in attribute X, and the other has a value from some other type in attribute X.
It is thus possible to list all the attribute names of a relation, and next to each name put the name of the type from which the tuples draw their values for that attribute. In the example given previously, this list would look like : GOTOSCHOOL:Boolean, DAY:Weekday. This list is typically referred to as
the heading of the relation and of the tuples.
-
What about the keys of a relation ?
Keys are, obviously, an important concept in relational databases. But as far as the purpose of this page is concerned, which is to explain the algebra, and to introduce the terminolgy that is needed to do so, the concept of keys bears no relevance.
In fact, it is, strictly speaking, even plain wrong to speak of "the key of a relation". In the RM, a key is something that applies to a relation variable, not to the values (the relations) such a variable possibly contains. A key is, in a sense, nothing more than a user-friendly facility that allows to
define certain specific classes of database constraint. This chapter does not deal with constraints, and so there need be no mention here of the concept of 'keys'.
Relational Algebra
-
The term 'algebra'
Time to start talking about the algebra. You may wonder at the idea of an 'algebra' with relations ... Wasn't 'algebra' about computing ? And isn't computing performed on numbers exclusively ? The answer is, obviously, 'no'. George Boole defined an algebra that allowed us to do computations on the logical
truth values 'true' and 'false'. An algebra exists that allows us to do computations on matrices. The operations of union, intersection and difference as they are applied in set theory, form an algebra that is used to do computations on sets. And so on and so on ...
Then what does it take to 'be an algebra' ? How is it that the Relational Algebra (RA for short) is indeed an algebra ? Well, an algebra is essentially nothing but a set of computation operators. Ordinary arithmetic on numbers, e.g., is an algebra consisting of the operators addition, subtraction,
multiplication, division, exponentiation. Set algebra includes the operators union, intersection and difference. Likewise, the RA is essentially just a set of operators that "compute" with relations as they have been defined in the introductory chapters.
And just as the addition and multiplication operators operate (in their most basic form) on two numbers to yield another number, so will an operator of the RA (commonly referred to as a 'relational operator') in its most basic form operate on two relations to yield another one.
-
Do I really need an algebra to manipulate the data in my databases ?
You might think that the answer to this question is 'no'. After all, you've been using SQL for years already perhaps, and no one's ever mentioned the word 'algebra' in connection to SQL, right ? Well, it may be right that you've never heard anyone mention the word 'algebra' in connection with SQL.
Nevertheless, the connection does indeed exist. The SQL language has, since day 1 of its existence, been intended as an implementation of the RA. Every query that you write is nothing more or less than an expression, a formula, of the RA. So if you've been using SQL, you've probably already written dozens of
expressions of the algebra. You just weren't aware.
-
A short note on 'value vs. variable' and operator nesting
We mentioned earlier that, as far as the purpose of the present document is concerned, it is not terribly wrong to think of a relation as 'being a table'. At this point, however, we should clarify an important distinction that SQL mostly fails to make, and hence that SQL users mostly fail to recognize.
When programming, you probably define variables that you subsequently use to store values in. You do so by "declaring" those variables with a certain name (e.g. int c;), after which you can assign a value to (i.e. store a value in) that same variable (e.g. c=1;). You usually also declare the type of values that
the variable is allowed to store (integers in the example).
In much the same sense, SQL's tables can also be considered a "variable" of the programming system. Not one that is declared locally for some small program procedure, but rather one that is declared globally for every program that accesses the database. (The C programming language, and probably lots of
others too, has the concept of 'static external' variables, which is somewhat similar.) As any other variable, it can be considered as "storing", or "containing" a value. That value being, precisely, the relation that consists of all the tuples that were "added" (and not yet deleted back again). A variable that
is declared to contain integer values specifically, is usually referred to as an "integer variable". By analogy, the RM uses the term "relation variable" for a variable that is declared to contain relation values specifically. RM literature often abbreviates that term to "relvar".
SQL blurs this distinction rather seriously by using the word 'table' for both of these concepts (the word stands both for the variable and for the -relation- value it contains).
When discussing the operators of the RA, it is important to remember that they always deal with relation _values_ specifically, without any consideration as to where and/or how that relation value originated. In particular, an operator of the RA does not prescribe that its input can only be a "table" (in
the 'variable' sense of the word). Quite the contrary : it is perfectly possible for the input of a relational operator to be some other expression of the RA. The input to a projection can be the result of a restrict, and the input to that restrict can be the result of a join, etc. etc. In other words :
expressions of the RA can have any arbitrary level of nesting, just so long as those expressions themselves evaluate to relation values, i.e. yield a relation value.
-
The syntax employed in the examples.
The examples in the discussion of the relational operators below, will adhere closely to the syntax used by SIRA_PRISE. For purposes of readability however, certain expressions in the examples will not be given using the syntactic notation used by SIRA_PRISE. This is mainly for purposes of
readability, and for maintaining a higher degree of "visual familiarity" towards database users who are used to SQL expressions.
-
The operators of the algebra
The operators that we will be introducing here, are the following : restriction, projection, extension, renaming, the transform shorthand, intersection, union,
difference, join, division, semijoin, semidifference, symmetric difference, left join, transitive closure, generalized transitive
closure, grouping, ungrouping, aggregation and summarization.
Note that the operators' "definitions" as we will present them here are not always "formal" definitions. The only goal we try to achieve is to explain the basic operation that the operator stands for, which can often be achieved equally well by explaining the basic algorithm it performs. The formal definitions
for the operators can be found in the javadoc for the operators.
-
Restriction
Often we will want to inspect a relation value and "retain" only those tuples of the inspected relation that satisfy certain conditions. The restriction operator is the operator that can perform that task for us. The restrict operator takes a relation as input, plus a boolean expression, and yields a
relation consisting of all and exactly those tuples of the input relation for which the boolean expression evaluates to 'true'.
The following example lists all the attributes whose defined maximal length exceeds 256 :
Restriction :
ARG |
RESTRICTCONDITION |
RESTRICT |
MAXIMUMLENGTH |
ATTRIBUTENAME |
124 |
CLIENTID |
2147483647 |
OPERANDSIGNATURE |
1022 |
CONSTRAINTLABEL |
786428 |
SP_EXPRESSION |
252 |
TYPESIZEHINT |
126 |
SPECIESNAME |
786428 |
RELVARPREDICATE |
126 |
NLSPECIESNAME |
786428 |
CONSTRAINTMESSAGETEXT |
1048572 |
CERTIFICATE |
|
gt(maximumlength,256) |
MAXIMUMLENGTH |
ATTRIBUTENAME |
2147483647 |
OPERANDSIGNATURE |
1022 |
CONSTRAINTLABEL |
786428 |
SP_EXPRESSION |
786428 |
RELVARPREDICATE |
786428 |
CONSTRAINTMESSAGETEXT |
1048572 |
CERTIFICATE |
|
Observe that just any boolean expression is allowed as the restrict condition, and that there is no obligation for the expression defining this restrict condition to reference any attribute of the relation that is restricted. So the following expressions are all valid :
RESTRICT(R , false)
RESTRICT(R , true)
RESTRICT(R , 1 = 1)
RESTRICT(R , 1 = 2)
A degenerate restrict condition :
ARG |
RESTRICTCONDITION |
RESTRICT |
MAXIMUMLENGTH |
ATTRIBUTENAME |
2147483647 |
OPERANDSIGNATURE |
1022 |
CONSTRAINTLABEL |
786428 |
SP_EXPRESSION |
786428 |
RELVARPREDICATE |
786428 |
CONSTRAINTMESSAGETEXT |
1048572 |
CERTIFICATE |
|
boolean(false) |
MAXIMUMLENGTH |
ATTRIBUTENAME |
|
Another thing that may be less obvious : the boolean expression can itself be of arbitrary complexity, and can also contain references to other relation variables ! It is, e.g. perfectly possible to check whether an integer attribute in some tuple contains a value that is equal to the cardinality of some
other relvar, as in 'X where score = cardinality(Y)', where score is an attribute of X, Y is some other database relvar, and cardinality is an operator that returns the number of tuples in (the relation that) its argument (evaluates to).
-
Projection
In information processing, it is very unlikely that all programs will always need to inspect the values of all attributes in a relation. Often, databases also contain 'comment' fields and 'label' fields which only serve a descriptive purpose, and which often aren't needed by the programs
that do the "core business" of the organisation.
So it might be interesting to be able to build, starting from a given relation, another relation that has only the attributes we really need. This is precisely what the projection operator can do for us. The projection operator takes a relation as one argument, and a set of attributenames as the
other (where obviously, that set can only contain names of attributes that effectively appear in the relation). It then produces a relation that holds one tuple for each combination of values that appears in the input relation for the given list of attributes (that tuple being precisely that combination of
values).
Important : take a close look at the way the definition for the result of the projection operator is formulated. The result of a projection contains one tuple "for each combination of values found for the projection attributes in the input relation". Nothing is said about how many times this "combination
of values" appears in the input, or how it would affect the number of times this "combination of values" appears in the result. In fact, it would be a violation of the relational model if any such thing were said. Any such "combination of values" can only appear once in the result anyway, otherwise the result
wouldn't be a relation. Remember that the projection operator is supposed to return a relation, that a relation was defined to be a set of tuples, and that therefore, the result of a projection cannot possibly contain the same tuple >1 times. Otherwise, the result of the projection would not be a relation
anymore (at least not guaranteeably in all cases). It is important to remember that the projection operator of the algebra always "eliminates duplicates" in its result.
Projection will eliminate duplicates :
ARG |
PROJECTSPEC |
PROJECT |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
(PAGESIZE) |
PAGESIZE |
16384 |
32768 |
49152 |
|
Projection on no attributes at all is supported :
ARG |
PROJECTSPEC |
PROJECT |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
() |
|
This example illustrates that it is problematic to visualise a relation with one tuple and no attributes as a table with one row and no columns (some browsers may allow you to observe that the line around this table is a bit thicker than the lines around the other tables), so things might be a little
clearer if we just write an equality test ( EQ(TABLE_DEE, ... ) ) of the result with TABLE_DEE (the nilary relation value that includes the zero-tuple in its body) :
Query result :
ARG |
PROJECTSPEC |
EQUAL_TO_TABLE_DEE |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
() |
True |
-
Extension
Sometimes, it is interesting to be able to "derive" values for additional attributes in a relation, based on the attributes that are already there. Application programs may benefit if the dbms can handle these kinds of "derivation" for them, rather than having to do it themselves. This is precisely what
the extension operator can do. The extension operator takes a relation as one argument, and a set of extension specifications as its second argument. An extension specification consists of both an attribute name and an expression that will determine the values of the additional attribute for each tuple of the
relation argument : for each tuple in the relation argument, every expression in the set of extension specifications is evaluated. For each such expression, the value obtained is "appended" to the tuple as the value of the attribute the expression corresponds to.
The following example uses the extend operator to compute the length of the name of the various dbms files :
Extension :
ARG |
EXTENDSPEC |
EXTEND |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
namelength(length(filename)) |
NAMELENGTH |
FILENAME |
PAGESIZE |
15 |
BW_SPECIES.SPDB |
32768 |
23 |
BW_POSSIBLESPECIES.SPDB |
16384 |
20 |
DATABASECATALOG.SPDB |
32768 |
20 |
BW_SPOTSXSPNAME.SPDB |
49152 |
13 |
BW_SPOTS.SPDB |
16384 |
|
A few rather obvious rules apply to the extension specifications : it is obviously illegal to specify an attribute name that already exists in the input relation (otherwise the result would contain two attributes of the same name, which is against the rules that the RA imposes on its relations), and it is
equally illegal to specify an attribute name twice in the set of extension specifications, for the same reason.
Like with restrict, just any expression can be specified, just as long as it does not contain references that the system is unable to resolve. In particular, literals are also valid expressions in an extend specification, as are expressions that somehow refer to other database relvars, including the
SIRA_PRISE context relvars :
Extension with an expression involving a [context] relvar reference :
ARG |
EXTENDSPEC |
EXTEND |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
currentdate(currentdate) |
CURRENTDATE |
FILENAME |
PAGESIZE |
|
BW_SPOTSXSPNAME.SPDB |
49152 |
|
DATABASECATALOG.SPDB |
32768 |
|
BW_SPECIES.SPDB |
32768 |
|
BW_SPOTS.SPDB |
16384 |
|
BW_POSSIBLESPECIES.SPDB |
16384 |
|
Note that care must be taken, however, to avoid the emerging of runtime exceptions, such as division by zero. If an extend expression involves an operator invocation that potentially gives rise to such exceptions, it is the programmers responsibility to take all necessary measures to prevent/avoid them.
-
Renaming
Attribute renaming is an operator that is very similar to projection and extension, and can in fact even be considered as a (very particular) combination of both. The prime reason why it exists as a separate operator nevertheless, is that it is rather frequently needed in combination with the join
operator (which will become clear when we discuss that operator).
Its operation should be clear simply by its name : it does nothing than produce a relation value that has exactly the same set of tuples (at least as far as the attribute values they hold are concerned), but where some of those attribute values now appear under another attribute name.
Formally, that is realised by giving as the operator's second argument a set of rename specifications. Each rename specification consists of two attribute names, the first one of which must refer to an existing attribute in the input relation, and the second one of which defines the name under which the
concerned attribute values will appear in the result of the operator's invocation.
SIRA_PRISE's RENAME is "parallel", meaning that it is not required that each individual new name be a name that does not appear in the argument. It is sufficient that there will be no duplicate names after ALL renaming has been applied. This facilitates "swapping" of attribute names :
A "parallel" RENAME :
ARG |
RENAMESPEC |
RENAME |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
(FILENAME,PAGESIZE,PAGESIZE,FILENAME) |
FILENAME |
PAGESIZE |
32768 |
DATABASECATALOG.SPDB |
16384 |
BW_SPOTS.SPDB |
49152 |
BW_SPOTSXSPNAME.SPDB |
16384 |
BW_POSSIBLESPECIES.SPDB |
32768 |
BW_SPECIES.SPDB |
|
-
Often, combinations of PROJECT/EXTEND/RENAME are needed in order to produce the desired result from a given relation. In many of these cases, the needed combination of invocations of these three can be combined into a single invocation of TRANSFORM. The features of the three operators combined into
TRANFORM are as follows :
- Like EXTEND, TRANFORM has the ability to compute values for extra new attributes in the relations produced.
- Like PROJECT, TRANFORM has the ability to discard some attributes from the input relation, as well as to simply retain some others.
- Like RENAME, TRANFORM has the ability to "copy over" values from input to output, but to make them appear under a new attribute name.
The following examples illustrate these abilities :
Extend, Project and Rename combined into one :
ARG |
TRANSFORMSPEC |
TRANSFORM |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
(pagesize,name(lowercase(the_string(filename)))) |
NAME |
PAGESIZE |
bw_species.spdb |
32768 |
databasecatalog.spdb |
32768 |
bw_possiblespecies.spdb |
16384 |
bw_spotsxspname.spdb |
49152 |
bw_spots.spdb |
16384 |
|
Note how :
- the PAGESIZE attribute is "copied over" since it is mentioned in the transform spec without being accompanied by a computational formula
- the FILENAME attribute is "projected away" by transform since it is not mentioned in the transform spec as an output attribute
- the NAME attribute is "added" to the result since it is mentioned in the transform spec, accompanied by a computational formula to produce a lowercase rendering of the FILENAME attribute of the input
-
Union
The union operator is intuitively easy to understand, because of its similarity to the union operator of classical set algebra. The union operator of the RA takes two relations as argument, and yields a relation that contains all and exactly those tuples that appear in either of the two input relations.
Observe that of course once again, the relational requirement of "no duplicate tuples" holds. Observe too that the heading of both arguments must be identical, for the result to be a valid relation. This means that the relation arguments of the union operator must have the same number of attributes and the same
set of attribute names, and for each pair of equally-named attributes, the type of that attribute must also be the same. Relations for which this condition holds are generally said to be "union-compatible".
The following example shows that UNION always eliminates duplicates :
A UNION of identical relations :
ARG1 |
ARG2 |
RESULT |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
-
Intersection
The relational intersection operator closely resembles its counterpart of classical set algebra. It takes two relations as input value, and yields a relation containing all and exactly those tuples that appear in both arguments. Observe that this implies that both arguments must be union-compatible.
Relational intersection :
ARG1 |
ARG2 |
INTERSECT |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_POSSIBLESPECIES.SPDB |
16384 |
|
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_POSSIBLESPECIES.SPDB |
16384 |
|
-
Difference
As is the case with union and intersection, so does the relational difference operator resemble very closely its counterpart in classical set algebra. As can be expected, the relational difference operator takes two relation values as argument, and yields a relation containing all and exactly those
tuples which do appear in the first argument (the minuend), but do not appear in the second (the subtrahend).
The following example employs the relational difference to find all the dbms files that have an 'O' in their name, but not a 'C' :
Relational difference :
ARG1 |
ARG2 |
MINUS |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_POSSIBLESPECIES.SPDB |
16384 |
|
FILENAME |
PAGESIZE |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_SPOTS.SPDB |
16384 |
|
-
The join operator implemented by SIRA_PRISE is the natural join operator. This operator is defined as follows :
If one relation has heading {a, b, c, ..., m, n, ...}, and another relation has heading {m, n, ..., v, w, x, ...}, then the natural join of these relations is a relation with heading {a, b, c, ..., m, n, ..., v, w, x, ...}, i.e. all the attributes that appear in the heading of either relation also appear
in the heading of the join (but in line with the prescriptions of the RA, they do so only once).
A tuple {a1, b1, c1, ... m1, n1, ..., v1, w1, x1, ...} will appear in the resulting relation if (and only if) : (A) a tuple {a1, b1, c1, ... m1, n1} appears in the one argument of the join, and (B) a tuple {m1, n1, ..., v1, w1, x1, ...} appears in the other argument of the join.
Observe that this definition holds equally well for the case where the set of attributes {a, b, c, ...} is actually empty (argument 1 has all attributes in common with argument2), the case where the set of attributes {m, n, ...} is actually empty (arguments have no attribute in common), and the case where
the set of attributes {v, w, x, ...} is actually empty (argument 2 has all attributes in common with argument 1).
The set of attributes {m, n, ...} is commonly referred to as the set of "join attributes". It follows from this definition that the set of join attributes is determined by (and only by) the heading of its arguments. Furthermore, the condition for "joining" two tuples is that all values in those two tuples
appearing for a join attribute, must be equal. Natural join does not allow to specify that the "join condition" is satisfied if join attributes are unequal.
If you're used to using SQL, those two properties may seem like a rigid restriction, but they can be easily overcome, and both by using the RENAME operator mentioned earlier. Renaming a join attribute in only one argument has the effect of "moving the attribute out of the set of join attributes", which
already solves the first problem.
As for the second problem, assuming that we wanted a join condition to be something similar to 'A.m1 > B.m1', we can simply rename the m1 attribute in one of both join arguments to move it "outside the set of join attributes", and then apply the greater-than condition in a restrict applied to the
result of the join. This need not imply that the join is first computed completely before the restrict condition is applied.
The following examples illustrate the characteristics of the natural join operator by showing the result when the two ARG relations are joined to produce the result listed under JOIN :
Natural JOIN :
ARG1 |
ARG2 |
JOIN |
STORAGESPACEID |
PAGECOUNT |
FILENAME |
1 |
2250 |
BW_SPOTS.SPDB |
2 |
23 |
DATABASECATALOG.SPDB |
17 |
30 |
DATABASECATALOG.SPDB |
51 |
412 |
DATABASECATALOG.SPDB |
3 |
650 |
BW_SPOTSXSPNAME.SPDB |
117 |
8 |
DATABASECATALOG.SPDB |
3 |
605 |
DATABASECATALOG.SPDB |
10 |
191 |
DATABASECATALOG.SPDB |
1 |
50 |
BW_POSSIBLESPECIES.SPDB |
52 |
43 |
DATABASECATALOG.SPDB |
145 |
57 |
DATABASECATALOG.SPDB |
21 |
610 |
DATABASECATALOG.SPDB |
65 |
150 |
DATABASECATALOG.SPDB |
1 |
2850 |
BW_SPECIES.SPDB |
15 |
24 |
DATABASECATALOG.SPDB |
54 |
14 |
DATABASECATALOG.SPDB |
22 |
14 |
DATABASECATALOG.SPDB |
|
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
STORAGESPACEID |
PAGESIZE |
PAGECOUNT |
FILENAME |
1 |
32768 |
2850 |
BW_SPECIES.SPDB |
54 |
32768 |
14 |
DATABASECATALOG.SPDB |
65 |
32768 |
150 |
DATABASECATALOG.SPDB |
22 |
32768 |
14 |
DATABASECATALOG.SPDB |
15 |
32768 |
24 |
DATABASECATALOG.SPDB |
10 |
32768 |
191 |
DATABASECATALOG.SPDB |
21 |
32768 |
610 |
DATABASECATALOG.SPDB |
117 |
32768 |
8 |
DATABASECATALOG.SPDB |
51 |
32768 |
412 |
DATABASECATALOG.SPDB |
1 |
16384 |
50 |
BW_POSSIBLESPECIES.SPDB |
3 |
49152 |
650 |
BW_SPOTSXSPNAME.SPDB |
2 |
32768 |
23 |
DATABASECATALOG.SPDB |
3 |
32768 |
605 |
DATABASECATALOG.SPDB |
52 |
32768 |
43 |
DATABASECATALOG.SPDB |
145 |
32768 |
57 |
DATABASECATALOG.SPDB |
1 |
16384 |
2250 |
BW_SPOTS.SPDB |
17 |
32768 |
30 |
DATABASECATALOG.SPDB |
|
Joins can also be performed with relation-typed attributes among the join attributes. The following is such an example, producing a tuple in the result because the OPERANDSIGNATURE attribute values are equal between the arguments :
A natural join involving relation-valued attributes as join attributes :
ARG1 |
ARG2 |
RESULT |
TYPENAME |
OINM |
OPERANDSIGNATURE |
BOOLEAN |
MATCHES |
TYPENAME |
ORDINAL |
STRING |
1 |
STRING |
2 |
|
|
OINM2 |
TYPENAME2 |
OPERANDSIGNATURE |
CONCAT |
STRING |
TYPENAME |
ORDINAL |
STRING |
1 |
STRING |
2 |
|
|
TYPENAME |
OINM2 |
OINM |
TYPENAME2 |
OPERANDSIGNATURE |
BOOLEAN |
CONCAT |
MATCHES |
STRING |
ORDINAL |
TYPENAME |
1 |
STRING |
2 |
STRING |
|
|
-
Division
Occasionally, businesses are interested in things like "the list of customers who already have all possible types of insurance policy", or "employees who have followed all training programs they were eligible for", or some such. Ever wondered how you would write such a query in SQL, without using any
program code ? To get to the solution, a little rephrasing of the stated problem (e.g. the customers with all types of insurance policy) will be helpful : "the list of customers for which no insurance policy type can be found such that that customer does not have a policy of that particular type". The SQL for
this might look something like :
SELECT ... FROM CUSTOMERS WHERE NOT EXISTS (SELECT ... FROM POLICYTYPE WHERE NOT EXISTS (SELECT ... FROM POLICY WHERE POLICY.CUSTNO = CUSTOMERS.CUSTNO AND POLICY.TYPE = POLICYTYPE.TYPE))
Queries such as these constitute an invocation of the relational division operator. This example sufficiently reveals that the relational division is a rather complex operator. Furthermore, it is an operator that isn't needed all that often, and is therefore usually not easily understood. The complexity
of its semantics are also revealed by the fact that it needs three relation arguments to work on (inspect the SQL version closely to verify this).
In order to explain the operator as intuitively as possible, we name the three relations that are an argument to the division the subject (customers in the SQL example), the intersection (policies in the SQL example) and the target (policy types in the SQL example).
Essentially, what the operator does, is to process the subject relation on a tuple-per-tuple basis. For each such tuple (e.g. each customer), it looks up (in the intersection relation) a "complete" set of information relating to that subject tuple (e.g. the "complete" set of all policy types that this
customer has subscribed), and then checks if that "complete" set of information is a superset of the target (e.g. all possible policy types), i.e. whether all entries in the target can also be found in the "complete set of information" from the intersection relation. If it is, then the subject tuple is included
in the result of the division, otherwise it is not.
Given that the operator needs to deal with "sets of information", we will later see that there are better ways to deal with this kind of problem.
You might wonder why we want the check against the target to be a superset test, and not an equality test. After all, surely there can't be policies of a policy type that doesn't exist, can there ? Possibly true, but the target might also be "all policy types that already have been commercialised by our
company for more than 20 years". In which case there could definitely be a policy type subscribed by a customer that does not appear in the target relation of the division.
There actually are quite a lot of subtleties such as this one related to the definition of the division operator. Others are to do with restrictions that are either or not imposed on the heading of the three arguments.
Which is yet another reason why the operator is rarely talked of. In fact, the operator is mentioned here more for completeness' sake and to draw the reader's attention to the class of problems it was aimed to solve.
-
Semijoin
In our discussion of join, we have seen that the heading of the result is the union of all attributes that appear in either one of the join arguments. Quite often though, this is not exactly what we need. Quite often we are only interested in knowing for which tuples of one relation, a "corresponding"
tuple can be found in another relation, without needing to know the actual contents of that corresponding tuple.
This is why the semijoin operator is usually defined as a separate operator, and indeed supported by SIRA_PRISE. The result of the semijoin of two arguments is the same as if the join of the two arguments were computed, and that result projected back again to just the attributes of the first argument of
the join.
The following examples illustrates semijoin :
Semijoin :
ARG1 |
ARG2 |
SEMIJOIN |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
CATEGORY |
PAGESIZE |
big |
32768 |
|
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
|
-
Semidifference
Just as we often find ourselves confronted with the need to know for which tuples of a relation there exists a corresponding tuple in another relation, so do we find ourselves often confronted with the need to know for which tuples of a relation there does not exist a corresponding tuple in some other
relation. This is precisely what the semidifference operator does.
The result of the semidifference operator is equal to the (relational) difference of the first argument and the semijoin of the two arguments :
Semidifference :
ARG1 |
ARG2 |
SEMIMINUS |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
CATEGORY |
PAGESIZE |
big |
32768 |
|
FILENAME |
PAGESIZE |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
-
Left join
Natural join has the property that it only includes tuples in the result if in the "second" relation, at least one tuple is found whose attribute values match the attribute values of the tuple (of the "first" relation) that is being joined to. This is often useful, but it is also often quite annoying.
Often, we want to retain all tuples from one of the join arguments, regardless of whether any matching tuples exist in the other argument. This is where LEFTJOIN is useful. The prescriptions of the relational model require, however, that the result contains a value for each attribute of that other argument. In
order to be able to comply to that rule, a third argument must be provided which lists all the attribute values that are to be included in the result for tuples of the first argument, for which no matching tuples could be found in the second. LEFTJOIN(R1,R2,attrvalues) is thus semantically
equivalent to UNION(JOIN(R1,R2),EXTEND(SEMIMINUS(R1,R2),attrvalues)) :
Left join :
ARG1 |
ARG2 |
expressions |
LEFTJOIN |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
CATEGORY |
PAGESIZE |
big |
32768 |
|
category(string(unknown)) |
CATEGORY |
FILENAME |
PAGESIZE |
unknown |
BW_POSSIBLESPECIES.SPDB |
16384 |
big |
DATABASECATALOG.SPDB |
32768 |
unknown |
BW_SPOTS.SPDB |
16384 |
big |
BW_SPECIES.SPDB |
32768 |
unknown |
BW_SPOTSXSPNAME.SPDB |
49152 |
|
-
Symmetric difference
Symmetric difference is sometimes considered in the relational algebra, seeing as it has a corresponding counterpart in set algebra. The operator is defined to be semantically equivalent to (RA MINUS RB) UNION (RB MINUS RA). Note that, just as in set theory, this is also equivalent to (RA UNION RB) MINUS
(RA INTERSECT RB).
Yet iow, it means symmetric difference will yield a relation that holds all the tuples that appear in exactly one of the two arguments :
Symmetric difference :
ARG1 |
ARG2 |
XMINUS |
RELVARNAME |
CONSTRAINTMESSAGETEXT |
INDEXATTRIBUTE |
|
RELVARNAME |
DATABASECONSTRAINTCHECK |
SYSTEMDEFINEDCONSTRAINT |
ASSIGNMENTCONSTRAINT |
ASSIGNMENTCONSTRAINTCHECK |
CONSTRAINTMESSAGETEXT |
ATTRIBUTELENGTHCONSTRAINT |
DATABASECONSTRAINT |
CONSTRAINTINVOLVESRELVAR |
TUPLECONSTRAINT |
|
RELVARNAME |
DATABASECONSTRAINTCHECK |
SYSTEMDEFINEDCONSTRAINT |
ASSIGNMENTCONSTRAINT |
ASSIGNMENTCONSTRAINTCHECK |
ATTRIBUTELENGTHCONSTRAINT |
DATABASECONSTRAINT |
CONSTRAINTINVOLVESRELVAR |
TUPLECONSTRAINT |
INDEXATTRIBUTE |
|
-
Transitive Closure
Have you ever needed to process a bill-of-materials structure, trying to find out, e.g. all the parts in which a certain given part is used, either directly or indirectly ? Then you probably know the problems you get confronted with trying to solve this with SQL. There's no problem finding out in which
parts y a certain given part x is used directly. There's no problem finding out all the parts z in which any part y is used in which the given part x is used. There's no problem finding out all the parts x in which any part z is used in which any part y is used in which the given part x is used. There's no
problem computing the union of all of these. The problem is where to stop.
In the early days, you either had to choose some upper limit, but then if the real number of "levels of containment" in the database exceeded that limit, your query produced incorrect results. Or else you had to code an iteration process yourself until you no longer found any part in which any part was
used in which any part was used in which any part was used in which part x was used ...
Later, SQL came up with the possibility of writing recursive queries, relieving the programmers of having to write the iteration process themselves. But I think most of you who have already written recursive queries will agree that writing these is still rather unintuïtive and error-prone. At least,
that's my personal opinion.
The operator of the RA that is "underneath" all of this, is called transitive closure. Transitive closures are typically applied to relations that represent the edges of a graph, that is, that represents the "direct" parent-child relationships between the nodes of such a graph. The essence of transitive
closure is that if its resulting relation shows the existence of a path from any node1 to any node2, as well as the existence of a path from that same node2 to any node3, the it should also show the existence of a path from node1 to node3.
Here is an example from the SIRA_PRISE catalog. The catalog holds definitions for "virtual relvars", which are essentially nothing but an assigned shorthand name for some other expression that may be defined in terms of other relvars. So there is a "definitional dependency" between such a virtual relvar
and the relvars that appear in its defining expression. However, those "relvars that appear in its defining expression", may themselves be virtual relvars too, that depend in turn upon other relvars defined in the catalog. And so on and so on and so on ...
The following example shows a relation that reflects the "direct" dependencies between virtual relvars and the relvars in terms of which they are defined, and its transitive closure.
Transitive closure of virtual relvar dependencies :
ARG |
TCLOSESPEC |
TRANSITIVE CLOSURE |
REFERENCEDRELVARNAME |
RELVARNAME |
POSSREPCOMPONENT |
TYPEDIRECTLYREFERENCEDBY |
UDTPHYSICALPOSSREPCOMPONENT |
TYPEDIRECTLYREFERENCEDBY |
INTERVALTYPE |
TYPESUPERTYPES |
DATAACTIONREFERENCES |
RELVARCLUSTER |
DATABASECONSTRAINTCHECK |
RELVARCLUSTER |
JAVABACKEDTYPE |
TYPESUPERTYPES |
ASSIGNMENTCONSTRAINTCHECK |
RELVARCLUSTER |
TRIGGEREDDATAACTION |
RELVARCLUSTER |
INTERVALTYPE |
TYPEDIRECTLYREFERENCEDBY |
USERDEFINEDTYPE |
TYPESUPERTYPES |
KEYATTRIBUTE |
RELVARKEYDEFS |
VIRTUALRELVARREFERENCES |
VIRTUALRELVARDEPENDENCYGRAPH |
TYPEDIRECTLYREFERENCEDBY |
TYPEDEPENDENCYGRAPH |
KEY |
RELVARKEYDEFS |
CONSTRAINEDTYPE |
TYPESUPERTYPES |
CONSTRAINTINVOLVESRELVAR |
RELVARCLUSTER |
CONSTRAINEDTYPE |
TYPEDIRECTLYREFERENCEDBY |
VIRTUALRELVARREFERENCES |
RELVARCLUSTER |
|
(RELVARNAME,REFERENCEDRELVARNAME) |
REFERENCEDRELVARNAME |
RELVARNAME |
POSSREPCOMPONENT |
TYPEDIRECTLYREFERENCEDBY |
UDTPHYSICALPOSSREPCOMPONENT |
TYPEDIRECTLYREFERENCEDBY |
INTERVALTYPE |
TYPESUPERTYPES |
DATAACTIONREFERENCES |
RELVARCLUSTER |
INTERVALTYPE |
TYPEDEPENDENCYGRAPH |
DATABASECONSTRAINTCHECK |
RELVARCLUSTER |
JAVABACKEDTYPE |
TYPESUPERTYPES |
ASSIGNMENTCONSTRAINTCHECK |
RELVARCLUSTER |
TRIGGEREDDATAACTION |
RELVARCLUSTER |
INTERVALTYPE |
TYPEDIRECTLYREFERENCEDBY |
USERDEFINEDTYPE |
TYPESUPERTYPES |
KEYATTRIBUTE |
RELVARKEYDEFS |
VIRTUALRELVARREFERENCES |
VIRTUALRELVARDEPENDENCYGRAPH |
UDTPHYSICALPOSSREPCOMPONENT |
TYPEDEPENDENCYGRAPH |
POSSREPCOMPONENT |
TYPEDEPENDENCYGRAPH |
TYPEDIRECTLYREFERENCEDBY |
TYPEDEPENDENCYGRAPH |
KEY |
RELVARKEYDEFS |
CONSTRAINEDTYPE |
TYPEDEPENDENCYGRAPH |
CONSTRAINEDTYPE |
TYPESUPERTYPES |
CONSTRAINTINVOLVESRELVAR |
RELVARCLUSTER |
CONSTRAINEDTYPE |
TYPEDIRECTLYREFERENCEDBY |
VIRTUALRELVARREFERENCES |
RELVARCLUSTER |
|
-
Generalized Transitive Closure
While TCLOSE allows us to figure out which nodes of a graph are -indirectly- connected to which other nodes (if the graph is represented as a binary relation of immediately connected nodes), that information often isn't sufficient. Quite often, we also want to know certain "properties" of the path that
connects one node with another. One very popular case in point is the count of edges on such a path. GTCLOSE can provide us with such information. We will explain the operation of GTCLOSE by highlighting the differences with TCLOSE.
The <AttributeNamePair> list to be specified in GTCLOSE has exactly the same syntax and semantics as it has in TCLOSE : it determines which attributes of one tuple of a pair under consideration must be equal to which attributes of the other tuple of the pair. Contrary to TCLOSE, however, it is not
required that all attributes of the relational expression subjected to GTCLOSE, are mentioned in the <AttributeNamePair> List. (In fact it would kind of defeat the purpose of GTCLOSE, seeing as in that case no additional path information would be computed, and the expression would degenerate to just
regular TCLOSE.)
The <GTCloseDef> clauses must name all the attributes of the <RelationalExpression> that are not mentioned in any <AttributeNamePair>, plus the expression that will determine the value for attributes included in the result. In the example below, EDGECNT is such an attribute. The
<Expression> that defines the computation, can reference any attribute of any of the two tuples of the pair of tuples that have been matched for inclusion, by prefixing an attribute name with either
__L__
or
__R__
. In the example, we take the sum of the EDGECNT values that are present in the two tuples being matched. If two tuples from the initial relation are being matched, then their EDGECNT values will be 1, resulting in an EDGECNT value of 2. If this new tuple is then subsequently matched with some other tuple of
the initial relation, the two EDGECNT values summed will be 2 and 1, resulting in an EDGECNT value of 3 in then new tuple. (and so on and son on.)
Quite often, we will want to produce these kind of closures starting from a relation that just holds identifications of connected nodes, but no attributes (and corresponding values) for the type of information we want to obtain about the paths in the graph that the closure finds between connected nodes.
So it will be a typical pattern that we will have to first extend that "binary" relation with "seed values" for the extra attributes at hand. In the example, we set the "seed value" for the EDGECNT attribute to 1, reflecting the fact that immediately connected nodes have an edge count between them of 1.
The EDGECNT example is one where "directionality" doesn't matter, but this isn't always the case. If we wanted to produce, say, a comma-separated list of all the intermediate nodes between two indirectly connected nodes, we'd have to do the following :
- "seed" the initial tuples with a PATH attribute value of the zero-length string (STRING())
- in the computation formula for the PATH attribute, specify the concatenation of __L__PATH, the comma character, the connecting node name (__L__REFERENCEDRELVARNAME or __R__RELVARNAME), comma character, __R__PATH
This example illustrates that the importance of choosing the right prefix for the involved attribute, as well as the fact that the computations do not always exclusively depend on the matching tuples' attribute values for the attribute being computed. The example code and resulting value are below (ARG
value not being shown):
gtclose(extend(virtualrelvarreferences,(edgecnt(int(1))path(string()))), (relvarname,referencedrelvarname), (edgecnt(plus(__L__edgecnt,__R__edgecnt))path(concat(__L__path,concat(string(,),concat(the_string(__R__relvarname),concat(string(,),__R__path)))))))
REFERENCEDRELVARNAME |
RELVARNAME |
PATH |
EDGECNT |
TRIGGEREDDATAACTION |
RELVARCLUSTER |
|
1 |
UDTPHYSICALPOSSREPCOMPONENT |
TYPEDEPENDENCYGRAPH |
,TYPEDIRECTLYREFERENCEDBY, |
2 |
DATAACTIONREFERENCES |
RELVARCLUSTER |
|
1 |
VIRTUALRELVARREFERENCES |
VIRTUALRELVARDEPENDENCYGRAPH |
|
1 |
INTERVALTYPE |
TYPEDIRECTLYREFERENCEDBY |
|
1 |
UDTPHYSICALPOSSREPCOMPONENT |
TYPEDIRECTLYREFERENCEDBY |
|
1 |
CONSTRAINEDTYPE |
TYPESUPERTYPES |
|
1 |
KEYATTRIBUTE |
RELVARKEYDEFS |
|
1 |
USERDEFINEDTYPE |
TYPESUPERTYPES |
|
1 |
POSSREPCOMPONENT |
TYPEDEPENDENCYGRAPH |
,TYPEDIRECTLYREFERENCEDBY, |
2 |
TYPEDIRECTLYREFERENCEDBY |
TYPEDEPENDENCYGRAPH |
|
1 |
CONSTRAINEDTYPE |
TYPEDEPENDENCYGRAPH |
,TYPEDIRECTLYREFERENCEDBY, |
2 |
KEY |
RELVARKEYDEFS |
|
1 |
INTERVALTYPE |
TYPESUPERTYPES |
|
1 |
POSSREPCOMPONENT |
TYPEDIRECTLYREFERENCEDBY |
|
1 |
JAVABACKEDTYPE |
TYPESUPERTYPES |
|
1 |
CONSTRAINTINVOLVESRELVAR |
RELVARCLUSTER |
|
1 |
VIRTUALRELVARREFERENCES |
RELVARCLUSTER |
|
1 |
CONSTRAINEDTYPE |
TYPEDIRECTLYREFERENCEDBY |
|
1 |
ASSIGNMENTCONSTRAINTCHECK |
RELVARCLUSTER |
|
1 |
DATABASECONSTRAINTCHECK |
RELVARCLUSTER |
|
1 |
INTERVALTYPE |
TYPEDEPENDENCYGRAPH |
,TYPEDIRECTLYREFERENCEDBY, |
2 |
It may be somewhat unintuitive why the __R__relvarname had to be used in order to obtain the correct name for the connecting node between "matched edges". As a visual aid, here is a representation of how "__L__" tuples are matched with "__R__" tuples :
relvarname in produce tuple |
matching attributes |
referencedrelvarname in produced tuple |
__L__relvarname |
__L__referencedrelvarname |
|
|
__R__relvarname |
__R__referencedrelvarname |
-
Grouping
Time to revert to the bit of theory from the introductory chapters. In section 2.5.2, it was already established that a Generalized product, and thus a relation too, can be considered as a type in its own right. There is an important consequence : we can use that type to define a relation heading and
declare a certain attribute in that heading to be of this type ! The value that such an attribute will take on, will be a tuple value specifically (remember that a generalized product/relation is a set of tuples, so if the relation is the type, the values drawn from it must be tuples).
However, this theoretical possibility is actually not all that interesting. There is little (actually nothing) that one can represent using a single tuple that one cannot represent using the needed number of separate attributes. It is far more interesting to have the possibility to have attributes inside
a tuple that take on a relation value. Indeed, there is one thing that a relation value can represent and that other representations cannot handle, and that is the empty relation (or empty set).
But if we want that possibility, what type do we draw those values from ? We haven't encountered any kind of set yet that had relations as its members. We do need such a set if we want relations to be drawn as values from a type. That set is not so difficult to define : given a generalized product, we can
"construct" the set of all possible subsets of that generalized product (In mathematics, this is known as the powerset of a set). Remember that since we named such a subset a relation, we have actually defined a set of all possible relations of a certain heading.
Relations are, themselves, values too, and relation attributes can therefore be declared to be of another relation type. Such attributes are called 'Relation-valued attributes' (RVA's for short).
In RA, two operators can be defined that allow us to manipulate relations in connection with RVA's : GROUP and UNGROUP. The former essentially serves the purpose of "creating" relations that have a relation-valued attribute, and UNGROUP essentially does the opposite.
GROUP takes as its first argument a relation value, and as its second argument a "grouping specification". This "grouping specification" consists of an attribute name for the relation that the GROUP operator will produce, and a set of attribute names that appear in the input relation.
GROUP will produce a relation of degree n+m, where n is the number of attributes of the input that have not been mentioned in any grouping specification, and m is the number of grouping specifications. Observe that n can be zero. The n attributes that have not been mentioned in any grouping specification
will also appear in the heading of the relation produced by GROUP. The other m attributes in the result of GROUP will have the name specified in the grouping specification, and be relation-typed (i.e. those will be RVA's in particular). For each of these m relation-typed attributes, the heading of their
relation value will consist of all and exactly those attributes from the (heading of the) input relation that were specified in the grouping specification for that relation-typed attribute.
Below is an example of using the GROUP operator to list relvar keys and the attributes that make them up.
Grouping that relation with this grouping specification, will yield the relation :
Grouping attributenames :
ARG |
GROUPSPEC |
GROUP |
ERRORCODE |
ATTRIBUTENAME |
245 |
RELVARNAME |
258 |
STORAGESPACEID |
244 |
RELVARNAME |
274 |
RECORDTYPENAME |
274 |
ORDINAL |
242 |
TYPENAME |
256 |
ATTRIBUTENAME |
245 |
RECORDTYPENAME |
244 |
RECORDTYPENAME |
250 |
RELVARNAME |
254 |
RELVARID |
242 |
COMPONENTNAME |
246 |
ERRORCODE |
248 |
ATTRIBUTENAME |
268 |
ERRORCODE |
258 |
FILENAME |
245 |
ORDINAL |
248 |
RELVARNAME |
256 |
RELVARNAME |
244 |
ATTRIBUTENAME |
255 |
RELVARNAME |
274 |
RELVARNAME |
|
attributenames(attributename) |
ERRORCODE |
ATTRIBUTENAMES |
245 |
ATTRIBUTENAME |
RECORDTYPENAME |
ORDINAL |
RELVARNAME |
|
274 |
ATTRIBUTENAME |
RECORDTYPENAME |
ORDINAL |
RELVARNAME |
|
255 |
|
248 |
ATTRIBUTENAME |
ATTRIBUTENAME |
RELVARNAME |
|
242 |
ATTRIBUTENAME |
COMPONENTNAME |
TYPENAME |
|
256 |
ATTRIBUTENAME |
ATTRIBUTENAME |
RELVARNAME |
|
268 |
|
244 |
ATTRIBUTENAME |
ATTRIBUTENAME |
RECORDTYPENAME |
RELVARNAME |
|
246 |
|
254 |
|
258 |
ATTRIBUTENAME |
FILENAME |
STORAGESPACEID |
|
250 |
|
|
RVA nesting can be done to more than one level :
ARG |
GROUPSPEC |
GROUP |
ERRORCODE |
RELVARNAME |
ATTRIBUTENAMES |
361 |
CLUSTEREDRECORDSPACELOCATTRS |
ATTRIBUTENAME |
FILENAME |
ATTRIBUTENAME |
STORAGESPACEID |
|
360 |
CLUSTEREDRECORDSPACELOCATTRS |
ATTRIBUTENAME |
FILENAME |
STORAGESPACEID |
ORDINAL |
|
|
keys(errorcode,attributenames) |
KEYS |
RELVARNAME |
ERRORCODE |
ATTRIBUTENAMES |
360 |
ATTRIBUTENAME |
FILENAME |
STORAGESPACEID |
ORDINAL |
|
361 |
ATTRIBUTENAME |
FILENAME |
ATTRIBUTENAME |
STORAGESPACEID |
|
|
CLUSTEREDRECORDSPACELOCATTRS |
|
Multiple groupings can be specified, creating multiple RVA attributes in one go :
ARG |
GROUPSPEC |
GROUP |
RECORDTYPENAME |
STSPID |
FILENAME |
MAXL |
RELVARNM |
R0_ACC |
51 |
DATABASECATALOG.SPDB |
320 |
ACC |
R1_ACC |
10 |
DATABASECATALOG.SPDB |
320 |
ACC |
|
storagespaces( filename,stspid ) recordtypes( recordtypename,maxl) |
STORAGESPACES |
RELVARNM |
RECORDTYPES |
STSPID |
FILENAME |
51 |
DATABASECATALOG.SPDB |
10 |
DATABASECATALOG.SPDB |
|
ACC |
RECORDTYPENAME |
MAXL |
R0_ACC |
320 |
R1_ACC |
320 |
|
|
Observe from this last example that by doing this "multiple grouping", you have actually "lost information", in that you have lost the "connection" between, e.g., recordtype R0_ and STSPID 51, and it will be impossible to reconstitute the original relation from the grouped one using UNGROUP.
As a final note, you might suspect a similarity here with SQL's GROUP BY ... construct here. That is not quite correct. The SQL construct corresponds more to the SUMMARIZE operator, which will be explained later. "Common" SQL actually does not have a counterpart to the GROUP operator, because it has no
way to expose a table as an attribute value to the user (The SQL standard does have ways to achieve that, but it is seriously clunky, certainly not all SQL implementations will support it, and even with an SQL implementation that supports it, the feature is used only rarely, if at all). As far as the link to
SQL is important, the GROUP BY clause creates a table with _EXACTLY THE SAME ROW COUNT AS THE INPUT_, with just a few "invisible ties" between the rows that are "grouped together". The GROUP opeator described here does something quite different, affecting the cardinality of the result quite seriously.
An important point to make about the GROUP operator is that it can provide a substitute for the DIVIDEBY operator discussed earlier. Setting up an expression using the GROUP operator that produces the same result as some invocation of DIVIDEBY is still somewhat tricky and elaborate, but it is certainly
doable.
-
Ungrouping
It will now also be intuitively clear what the purpose is of UNGROUP : that is precisely to "revert" relations such as those produced by GROUP, to a relation that does not have any (or certainly less) relation-typed attributes.
There are certain relation values that GROUP cannot produce, but for which UNGROUP nevertheless has well-defined results. In particular, GROUP can never produce a relation value that has a relation-valued attribute for which the empty relation appears as a value in the result of the GROUP invocation.
UNGROUP must, however, be able to deal with such relations. In particular, it will do so by not including a corresponding tuple in its result.
An Ungrouping over "operandsignature" - note how the empty RVA makes "PI" disappear from the result :
ARG |
UNGROUP |
OPINVNM |
RETURN |
OPERANDSIGNATURE |
PI |
FLOAT |
|
PLUS |
AVERAGE |
TYPENAME |
ORDINAL |
AVERAGE |
2 |
AVERAGE |
1 |
|
PLUS |
SCALEDNUMBER |
TYPENAME |
ORDINAL |
SCALEDNUMBER |
1 |
SCALEDNUMBER |
2 |
|
PRIORLONG |
LONG |
|
PRIORDATE |
DATE |
|
PLUS |
TIMEOFDAY |
TYPENAME |
ORDINAL |
TIMEOFDAY |
1 |
TIMEOFDAY |
2 |
|
PLUS |
INT |
TYPENAME |
ORDINAL |
INT |
2 |
INT |
1 |
|
PRIORINT |
INT |
|
PLUS |
FLOAT |
TYPENAME |
ORDINAL |
FLOAT |
1 |
FLOAT |
2 |
|
PLUS |
LONG |
TYPENAME |
ORDINAL |
LONG |
2 |
LONG |
1 |
|
PLUS |
BIGINT |
TYPENAME |
ORDINAL |
BIGINT |
1 |
BIGINT |
2 |
|
|
TYPENAME |
ORDINAL |
OPINVNM |
RETURN |
INT |
1 |
PRIORINT |
INT |
AVERAGE |
1 |
PLUS |
AVERAGE |
AVERAGE |
2 |
PLUS |
AVERAGE |
SCALEDNUMBER |
1 |
PLUS |
SCALEDNUMBER |
TIMEOFDAY |
1 |
PLUS |
TIMEOFDAY |
INT |
1 |
PLUS |
INT |
SCALEDNUMBER |
2 |
PLUS |
SCALEDNUMBER |
LONG |
1 |
PRIORLONG |
LONG |
TIMEOFDAY |
2 |
PLUS |
TIMEOFDAY |
INT |
2 |
PLUS |
INT |
FLOAT |
2 |
PLUS |
FLOAT |
FLOAT |
1 |
PLUS |
FLOAT |
LONG |
1 |
PLUS |
LONG |
LONG |
2 |
PLUS |
LONG |
DATE |
1 |
PRIORDATE |
DATE |
BIGINT |
1 |
PLUS |
BIGINT |
BIGINT |
2 |
PLUS |
BIGINT |
|
-
Aggregation
The aggregation operator is not commonly defined as a separate operator of the RA. It is specific to SIRA_PRISE, and the reason is that, despite the fact that it is basically just a building block on which the summarization operator builds, it may nevertheless in certain specific cases be useful to expose
aggregation as a separately existing operator to the user. Summarization also becomes easier to explain once aggregation has been defined.
The aggregation operator takes one relation as an input argument, and a set of aggregation specifications as its second argument. Each aggregation specification consists of an attributename for the relation produced by the operator, the name of an associative operator that will produce the aggregated
attribute value in the resulting relation, and an expression defining the value to be aggregated.
The aggregation operator produces its result by proceeding as follows :
First, for each aggregation spec an intermediate result is initialized to the identity value of the associative operator named in the aggregation spec.
Then, the entire input relation is processed on a tuple-per-tuple basis, and for each tuple :
The value of each expression in an aggregation spec is evaluated for that tuple, and that value (v) is then "aggregated" into the intermediate result by invoking the associative operator named in the aggregation spec with as arguments both the intermediate result and the obtained value v.
When all tuples of the input relation are processed, a singleton relation is produced with one attribute per aggregation spec, where the value for each attribute is the last "intermediate" result of that aggregation spec.
E.g. an aggregation spec COUNT(PLUS(INT(1))), will evaluate the literal value 1 for each tuple, invoke the PLUS operator for each tuple, and produce a tuple holding an attribute COUNT whose value is equal to the last "intermediate result" obtained (which will thus be equal to the number of tuples in the
input relation).
Likewise, an aggregation spec COUNT(PLUS(ATTR)), will evaluate the attribute ATTR for each tuple and sum those.
Aggregation, showing the evaluation of a literal for each tuple, an attribute of each tuple, and an expression involving attributes of each tuple :
ARG |
AGGREGATESPEC |
AGGREGATE |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
silly(plus(pagesize)) silly2(mult(tofloat(pagesize))) count(plus(int(1))) |
SILLY2 |
COUNT |
SILLY |
1.4167099448608936E22 |
5 |
147456 |
|
The associative operator in an aggregation can also be one of the interval operators. In the case of IINTERSECT, this can give rise to the empty interval appearing as a value in a relation (this is the only way this can happen in SIRA_PRISE) :
Aggregation - a very very special case :
ARG |
AGGREGATESPEC |
AGGREGATE |
PS |
BEGIN(16384)END(16385) |
BEGIN(49152)END(49153) |
BEGIN(32768)END(32769) |
|
iintersect(iintersect(ps)) |
IINTERSECT |
BEGIN(16384)END(16384) |
|
Observe that AGGREGATE requires its operator to be both associative and commutative on the corresponding expression type. But then how about computing averages, which is perhaps the most common form of aggregation ? To do averaging, SIRA_PRISE requires the use of the PLUS operator, which operates, not on
any number type, but on the special-purpose type AVERAGE :
AGGREGATE(DBMSFILE,(AVGPAGESIZE(PLUS(AVERAGE(COUNT(1)VALUE(TOFLOAT(PAGESIZE)))))))
Using AGGREGATE to compute averages :
ARG |
AGGREGATESPEC |
AGGREGATE |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
AVGPAGESIZE(PLUS(AVERAGE(COUNT(1)VALUE(TOFLOAT(PAGESIZE))))) |
AVGPAGESIZE |
COUNT(5)VALUE(29491.199999999997) |
|
Observe the following points :
- The operator name is PLUS
- The expression aggregated is a value selector of type AVERAGE
- That value selector selects values with two components, an observation count and an observation value
- The observation count component is of type INT
- The observation value is of type FLOAT
- Getting the 'FLOAT' value of the PAGESIZE attribute, involves invoking the TOFLOAT "conversion" operator
Cumbersome, but it works, and at least it is explicit about the observation count, also in the following case :
AGGREGATE(TABLE_DUM,(AVG(PLUS(AVERAGE(COUNT(1)VALUE(FLOAT(1)))))))
Another example which finds the length of the longest scalar type :
AGGREGATE(DBMSFILE,(LONGESTLENGTH(MAX(MAXIMUMSIZE))))
Using aggregation to find the highest appearing value :
ARG |
AGGREGATESPEC |
AGGREGATE |
FILENAME |
PAGESIZE |
DATABASECATALOG.SPDB |
32768 |
BW_SPECIES.SPDB |
32768 |
BW_SPOTSXSPNAME.SPDB |
49152 |
BW_POSSIBLESPECIES.SPDB |
16384 |
BW_SPOTS.SPDB |
16384 |
|
LONGESTPAGESIZE(MAX(PAGESIZE)) |
|
Aggregation on empty inputs yields the aggregation operator's identity element, if that exists :
Aggregation on empty inputs :
ARG |
AGGREGATESPEC |
AGGREGATE |
|
longestpagesize(max(pagesize)) shortestpagesize(min(pagesize)) totalpagesize(plus(pagesize)) |
TOTALPAGESIZE |
LONGESTPAGESIZE |
SHORTESTPAGESIZE |
0 |
-2147483648 |
2147483647 |
|
-
Summarization
Summarization is basically the operator underlying SQL's GROUP BY construct. In terms of the RA, it can be defined as a combination of invocations of GROUP and AGGREGATE.
Summarization takes three arguments : a relation R to be summarized, a set of attributenames A defining which "groups of tuples" of R will be summarized together, and a set of summarization specs S which are completely identical in nature to the aggregation specs presented in the foregoing section.
Summarization essentially obtains its result by performing the following steps :
First, GROUP R such that all attributes mentioned in A are retained, and all the other attributes of R become a member of a single relation-valued attribute RVA.
Then, process the result of that grouping on a tuple-per-tuple basis, where for each tuple, the relation value of its RVA attribute is aggregated using the given summarization / aggregation specs, and a resulting tuple is produced with all attributes in A, plus the RVA attribute, whose value will be the
singleton relation obtained from the aggregation.
Then, that relation is UNGROUPED back again. The result will thus be a relation whose heading has all attributes mentioned in A, plus one attribute per summarization spec. The values for these latter attributes will be values that have been computed by the aggregation mechanism.
As an example, we show how to obtain summary information about the system's DBMSFILEs and the STORAGESPACEs they hold :
Using SUMMARIZEBY to obtain storagespace counts and storagespace pagecounts per dbms file:
ARG |
SUMMBYSPEC |
AGGREGATESPEC |
SUMMARIZEBY |
STORAGESPACEID |
PAGECOUNT |
FILENAME |
EXTENTSCOUNT |
145 |
57 |
DATABASECATALOG.SPDB |
30 |
22 |
14 |
DATABASECATALOG.SPDB |
30 |
1 |
2850 |
BW_SPECIES.SPDB |
10 |
15 |
24 |
DATABASECATALOG.SPDB |
30 |
21 |
610 |
DATABASECATALOG.SPDB |
30 |
3 |
605 |
DATABASECATALOG.SPDB |
30 |
10 |
191 |
DATABASECATALOG.SPDB |
30 |
52 |
43 |
DATABASECATALOG.SPDB |
30 |
3 |
650 |
BW_SPOTSXSPNAME.SPDB |
10 |
1 |
50 |
BW_POSSIBLESPECIES.SPDB |
10 |
1 |
2250 |
BW_SPOTS.SPDB |
10 |
17 |
30 |
DATABASECATALOG.SPDB |
30 |
117 |
8 |
DATABASECATALOG.SPDB |
30 |
2 |
23 |
DATABASECATALOG.SPDB |
30 |
65 |
150 |
DATABASECATALOG.SPDB |
30 |
51 |
412 |
DATABASECATALOG.SPDB |
30 |
54 |
14 |
DATABASECATALOG.SPDB |
30 |
|
(filename) |
storagespacecount(plus(int(1))) pagecount(plus(pagecount)) |
PAGECOUNT |
FILENAME |
STORAGESPACECOUNT |
2850 |
BW_SPECIES.SPDB |
1 |
2181 |
DATABASECATALOG.SPDB |
13 |
50 |
BW_POSSIBLESPECIES.SPDB |
1 |
650 |
BW_SPOTSXSPNAME.SPDB |
1 |
2250 |
BW_SPOTS.SPDB |
1 |
|
-
A remark on associativity
The relational operators union, intersect, symmetric difference and join have been defined here in such a way that they accept exactly two arguments. However, it happens to be the case that these operators have the mathematical property of being associative. SIRA_PRISE therefore allows invocations of these
operators to have more than two arguments.
The theoretical possibility of invocations of these operators with only one argument, or with no arguments at all, is syntactically not supported by SIRA_PRISE. If these operators are invoked in the context of an aggregation or summary, then SIRA_PRISE will compute the results correctly.
The RA and interval types
-
Rationale
It's quite likely that you have already run into the need for your databases to keep track of historical data. It's an old problem, and existing technology does not really offer a comfortable or easy way for programmers to deal with it. Note very carefully that I haven't said "the Relational Model does not
offer ...". That is due to the fact that the relational model does offer an adequate solution for the problem, it's just that the currently existing implementations of the model don't offer an adequate solution.
If you have had to design a database that kept track of historical data, then your solution probably was to "add a startdate to the table, and include that in the key". You may or may not have included an enddate in the table too, and depending on whether or not you did, found yourself confronted to either
one of the following two problems for which no easyly written and easily maintainable solution exists :
If you do include an enddate, how will you prevent some row in your table from "overlapping" with another row (and possibly even contradict with that other row).
If you do not include an enddate, what is the query that will give you the applicable enddate for a certain row, if requirements are such that both the start date and the end date are needed ?
In general, you will probably have found that writing queries becomes a very tedious and error-prone process, right up to the point where writing correct queries actually becomes a real nightmare.
There does, however, exist a much better solution to the problem. That solution is explained thoroughly in the TDATRM and TaRT books. These books are strongly recommended reading for any programmer who has to deal with historical data. The solution is based on the usage of what is called "interval types".
SIRA_PRISE implements a solution based on interval types that is inspired by said book. There does exist a significant difference in approach between the book and SIRA_PRISE, but that difference does not show up in the results that the relational operators produce. So, as far as the relational operators
(as implemented in SIRA_PRISE) themselves are concerned, I believe they behave identically to the behaviour depicted in the book. Which is what matters.
-
The interval type
The word 'interval' may vaguely ring a bell. You may remember having been taught about them in your math classes, and maybe wondered what good it all was. Well, the 'interval type' used here is exactly the same kind of thing.
Consider two attributes in a relvar called 'STARTDATE' and 'ENDDATE' respectively, both declared to be of type DATE. The rest of the attributes contain the "actual" information, e.g. :
NAME |
DISCIPLINE |
STARTDATE |
ENDDATE |
Bob Beamon |
Long Jump |
1968-07-24 |
1998-06-18 |
You can probably guess what this row intends to declare : Bob Beamon held the world record Long Jump from 1968-07-24 to 1998-06-18. Now, what "meaning" can be ascribed to these dates, what "things" do these dates tell us ? Two are rather obvious : one being that way precisely on july, 24 of 1968, Bob
Beamon acquired the world record, and another one being that precisely on june, 18 of 1998, he lost the world record to someone else who did better. But there's a third thing they tell us, and that's about all the dates that are "between" these two dates : we can say Bob Beamon was the world record holder on all
these dates in between, too.
In that sense, the two dates are somehow "closely related", or it might even be said that in that sense, they actually "belong together". Expressing that "belonging together" is precisely what is done by representing these dates as an interval :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1998-06-18 |
Note that this table now has only three columns, representing the fact that there now are only three attributes in the relation. Note also that this 'DURING' attribute is not an attribute of type DATE (indeed it does not contain a valid date), but rather of type 'INTERVAL of DATE'.
Informally, one can think of this interval as "representing a set of dates", where this set has the particular property that there does not exist a date that is "in between" the start and end dates, and where that date is not a member of the set. In other words, the set represented by this interval is
"contiguous", meaning that all the dates it contains form a contiguous series.
One might think that, by replacing the two DATE attributes with one single INTERVAL value, we have lost the possibility to obtain the actual date when Bob Beamon acquired his world record. Indeed, there no longer is an actual date value in the relation. Obviously, this is undesirable, and this must be
catered for. That is done by defining a set of operators to deal with values of an interval type which allow us to manipulate these interval values. The most important of them are : STARTPOINTOF (synonymous to THE_BEGIN as well as THE_FROM), ENDPOINTOF (synonymous to THE_END, and disregarding the issue of
closed/closed vs. closed/open interpretation of the bounds, also to THE_TO), IUNION, IINTERSECT, and IMINUS. Another important class of operators in connection with interval types are similar to those known as "Allen's operators", after the man who first proposed them. They allow to verify whether two interval
values are adjacent, overlapping, disjoint, contained within one another, etc. etc.
STARTPOINTOF will "retrieve" the start date from an interval value, ENDPOINTOF will "retrieve" the end date from such an interval value. The other three are the obvious equivalents of the like-named operators of the set algebra (this makes sense since an interval can indeed be thought of as being a set).
Obviously, the interval versions of these operators must ensure that the value they produce is indeed still an interval (e.g. the set union of two "contiguous" sets is not necessarily itself a "contiguous" set).
-
Packed form of a relation
The information contained in the 'world-records' example can be represented in a multitude of ways, of which only the simplest was given. Another relation representing exactly the same information is, e.g., the following :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1995-03-10 |
Bob Beamon |
Long Jump |
1992-01-12 - 1998-06-18 |
Yet another one might be, e.g. :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1995-03-10 |
Bob Beamon |
Long Jump |
1995-03-10 - 1998-06-18 |
In the first example, the two tuples both ascribe the world record for the same discipline to the same person, for two different periods, but those periods overlap. In the second example, once again the world record for the same discipline is ascribed to the same person for two different periods, and
allthough they no longer overlap, it is still the case that the two periods together form one single, contiguous, span of time.
In the first example, that causes a form of reduncancy to appear in the database : for the period from 1992-01-12 to 1995-03-10, there are two distinct tuples each saying who held the world record in that particular period. If someone later discovered that this information is actually incorrect, then two
distinct tuples would have to be corrected. All such forms of reduncancy are certainly undesirable. The second example does not suffer from this kind of redundancy, but nevertheless it is still a "more complicated" way of saying the very same thing compared to what is also said by the single-tuple example at the
beginning of this section. More complexity without additional value, is also undesirable.
So it is therefore, in general, desirable that the interval values in the database are "as large as possible", without losing or corrupting any information. If a relation heading has at most one interval-typed attribute, then there is exactly one relation that satisfies that criterion. That relation is
said to be in "packed form". Relations in packed form have the desirable properties that there are no points within the intervals for which the same logical proposition is implied by two distinct tuples, and that there are no two distinct tuples that can be "merged into one" whilst maintaining the same
"information set".
Since these properties are desirable, it is also desirable that the relational operators always produce relations that are in packed form. That is why, by default, the operators discussed in the foregoing section will ALWAYS produce relations that are in packed form. The following sections will :
- discuss the impact that the presence of interval-typed attributes has on the operation and/or results of the relational operators defined in the foregoing section,
- discuss all the additional operators supported by SIRA_PRISE for dealing with relations holding interval-typed attributes specifically. These are PACK, UNPACK, UNIONUSING, MINUSUSING, INTERSECTUSING,
JOINUSING, and a set of "derivatives" of these including XMINUSUSING, SEMIJOINUSING, SEMIMINUSUSING and LEFTJOINUSING,
- discuss whether the "always produces a relation in packed form" is implemented by the operator doing its own explicit invocation of PACK, or just by relying on the input being packed, knowing that its own operation cannot affect pack state of the relation. (If the latter, feeding the operator with a
relation that was explicitly UNPACKed, will not produce a packed relation, somewhat contrary to the "results always packed " claim.)
-
The relational operators and Interval types
-
PACK
Most of the time, it is desirable for relations to be in packed form. SIRA_PRISE facilitates that by always packing the results of a relational operator invocation, with one exception, which is when the obvious purpose of the operator invocation is to explicitly obtain unpacked results (see UNPACK).
However, when more than one interval-typed attribute is part of the relation, multiple packed forms are possible, and SIRA_PRISE has to choose one "arbitrarily". SIRA_PRISE chooses the one which has the list of PACK attributes in lexicographical order, but sometimes it might be needed to obtain one of the
other packed forms. For that purpose, the PACK operator is explicitly exposed.
The PACK operator takes a relational expression for argument, plus an ordered list of attributes in that relation (all of which must be interval-typed in the input relation). Call that list of interval-typed attributes the "pack list", and call the set of all other attributes the "nonpack" attributes
(note that there may be interval-typed attributes of the relation among them). Informally, the results of an invocation of the PACK operator satisfy the following properties :
- If a combination of point values, covered by the interval values for the attributes in the PACK list, appears in the input relation for a certain combination of attribute values for the attributes in the nonpack list, then exactly one tuple will be in the output relation such that its interval values
for the pack list cover the concerned point values, and the nonpack attribute values are the same as the input.
- The interval values for the pack attributes in the output relation will be as large as possible, with the goal of maximizing interval values for the first attribute in the pack list taking precedence over trying to maximize values for the second, etc. etc.
Some examples should clarify this a lot : relation(body(tuple(x(intinterval(from(4)to(7)))y(intinterval(from(1)to(4)))) tuple(x(intinterval(from(1)to(3)))y(intinterval(from(1)to(13))))))
PACKing was automaticallly done in lexicographical order (X,Y) :
X |
Y |
BEGIN(1)END(8) |
BEGIN(1)END(5) |
BEGIN(1)END(4) |
BEGIN(5)END(14) |
Packing that value on (Y,X) :
Query result :
ARG |
PACK_Y_X |
X |
Y |
BEGIN(1)END(4) |
BEGIN(5)END(14) |
BEGIN(1)END(8) |
BEGIN(1)END(5) |
|
X |
Y |
BEGIN(1)END(4) |
BEGIN(1)END(14) |
BEGIN(4)END(8) |
BEGIN(1)END(5) |
|
PACK does not guarantee that the result has the minimum possible cardinality :
PACK can possibly _increase_ the cardinality of the result, compared to the input :
ARG |
PACK_Y_X |
X |
Y |
BEGIN(1)END(10) |
BEGIN(1)END(10) |
BEGIN(7)END(24) |
BEGIN(8)END(34) |
|
X |
Y |
BEGIN(7)END(10) |
BEGIN(1)END(34) |
BEGIN(10)END(24) |
BEGIN(8)END(34) |
BEGIN(1)END(7) |
BEGIN(1)END(10) |
|
PACK is also useful for achieving the functionality of TDATRM's / TaRT's "USING equality". Instead of comparing two relations directly using the relation equality operator, compare their packed forms (packed on the same pack list, of course) :
The EQ column shows the result of EQ(ARG1,ARG2), the EQUSING columns shows the result of EQ(PACK(ARG1 , ...),PACK(ARG2 , ...)) :
ARG2 |
ARG1 |
EQ |
EQUSING |
X |
Y |
BEGIN(1)END(4) |
BEGIN(1)END(14) |
BEGIN(4)END(8) |
BEGIN(1)END(5) |
|
X |
Y |
BEGIN(1)END(4) |
BEGIN(5)END(14) |
BEGIN(1)END(8) |
BEGIN(1)END(5) |
|
False |
True |
-
UNPACK
The UNPACK operator from TDATRM is one one would typically want to avoid using as much as possible. Basically, what it does is include a tuple in the result for each and every combination of point values covered by the input relation (that is, covered by the interval values that appear in the relation
for the attributes mentioned in the unpack list), with single-point interval values for the attributes specified in an unpack list. However, a use case has been revealed where it can actually serve us well, and that is the case of "generating sequences". If for any valid reason, you need a relation with one
tuple for each number in a range (that range must be over an ordinal type otherwise we'd be generating infinite relations), then the UNPACK operator helps you do this at the snap of a finger.
Unpacking a single interval-typed attribute :
ARG |
UNPACK |
|
II |
BEGIN(7)END(8) |
BEGIN(8)END(9) |
BEGIN(3)END(4) |
BEGIN(5)END(6) |
BEGIN(1)END(2) |
BEGIN(6)END(7) |
BEGIN(2)END(3) |
BEGIN(4)END(5) |
|
Observe that the attribute type for II is still the INTERVAL type, and not the underlying point type. For obtaining the relation holding the series of values of the underlying point type, we still have to apply a transformation :
TRANSFORM(... , NUMBER(THE_FROM(II))) :
Query result :
ARG |
SERIES |
|
|
It should now be obvious why invocations of the UNPACK operator are exempted from the general rule that results from relational operator invocations are always automatically PACKed.
One final example to illustrate UNPACK's devastating combinatorial explosion (just because we couldn't resist ):
UNPACK can produce big piles of tuples :
ARG |
UNPACK_OVER |
UNPACK |
II |
LL |
DD |
BEGIN(1)END(4) |
BEGIN(8)END(12) |
BEGIN(2015-07-30)END(2015-08-02) |
|
ii,dd,ll |
II |
LL |
DD |
BEGIN(2)END(3) |
BEGIN(10)END(11) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(1)END(2) |
BEGIN(10)END(11) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(3)END(4) |
BEGIN(10)END(11) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(1)END(2) |
BEGIN(8)END(9) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(1)END(2) |
BEGIN(9)END(10) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(2)END(3) |
BEGIN(9)END(10) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(2)END(3) |
BEGIN(8)END(9) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(3)END(4) |
BEGIN(11)END(12) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(2)END(3) |
BEGIN(11)END(12) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(3)END(4) |
BEGIN(8)END(9) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(1)END(2) |
BEGIN(11)END(12) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(2)END(3) |
BEGIN(11)END(12) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(3)END(4) |
BEGIN(11)END(12) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(2)END(3) |
BEGIN(9)END(10) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(3)END(4) |
BEGIN(9)END(10) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(1)END(2) |
BEGIN(9)END(10) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(3)END(4) |
BEGIN(9)END(10) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(1)END(2) |
BEGIN(11)END(12) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(1)END(2) |
BEGIN(11)END(12) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(3)END(4) |
BEGIN(11)END(12) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(3)END(4) |
BEGIN(9)END(10) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(1)END(2) |
BEGIN(8)END(9) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(3)END(4) |
BEGIN(8)END(9) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(1)END(2) |
BEGIN(8)END(9) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(3)END(4) |
BEGIN(10)END(11) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(2)END(3) |
BEGIN(9)END(10) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(1)END(2) |
BEGIN(9)END(10) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(2)END(3) |
BEGIN(8)END(9) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(1)END(2) |
BEGIN(10)END(11) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(2)END(3) |
BEGIN(10)END(11) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(2)END(3) |
BEGIN(8)END(9) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(1)END(2) |
BEGIN(10)END(11) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(2)END(3) |
BEGIN(10)END(11) |
BEGIN(2015-07-30)END(2015-07-31) |
BEGIN(3)END(4) |
BEGIN(8)END(9) |
BEGIN(2015-08-01)END(2015-08-02) |
BEGIN(2)END(3) |
BEGIN(11)END(12) |
BEGIN(2015-07-31)END(2015-08-01) |
BEGIN(3)END(4) |
BEGIN(10)END(11) |
BEGIN(2015-07-30)END(2015-07-31) |
|
-
Restriction
Restriction is only affected by the existence of interval types, in that the set of operators is extended that can be invoked in its boolean expression. If a relation heading has an interval-typed attribute, then it is only obvious that it should become possible to invoke, e.g., the OVERLAPS() operator
using that attribute as one argument and using some other interval value as the second.
Restriction cannot produce a non-packed relation if its input relation is itself packed. Therefore, RESTRICT will not attempt any packing of itself, and so if the input of RESTRICT is unpacked, then so will be the result :
Restricting an unpacked relation with the degenerate condition "true" :
ARG |
RESTRICT |
II |
BEGIN(3)END(4) |
BEGIN(1)END(2) |
BEGIN(2)END(3) |
|
II |
BEGIN(3)END(4) |
BEGIN(1)END(2) |
BEGIN(2)END(3) |
|
-
Projection
The same obviously does not hold for projection. Assuming the world-records relation was the following value :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1998-06-18 |
Mike Powell |
Long Jump |
1998-06-18 - |
Then a projection of this relation over attributes DISCIPLINE and DURING could yield :
DISCIPLINE |
DURING |
Long Jump |
1968-07-24 - 1998-06-18 |
Long Jump |
1998-06-18 - |
This relation is not in packed form. The projection operator implemented in SIRA_PRISE will therefore not produce this relation value as its result, but rather the following :
DISCIPLINE |
DURING |
Long Jump |
1968-07-24 - |
Note the subtle difference in meaning between the two results :
- the former : A person exists such that throughout DURING, that person held the world record in DISCIPLINE
- the latter : Throughout DURING, a person exists such that that person held the world record in DISCIPLINE [at least somewhere during that period]
If the former is what you really need, there is no way to achieve this projection result while leaving the interval attributes intact. You need to TRANSFORM the relation, decomposing the concerned interval-typed attributes into its constituent FROM and TO components, and project that relation.
-
Extension / Renaming
These operators are not affected by the possible appearance of interval values in the attributes of the relations they operate on, except then (in the case of extend) in the set of operators that can be invoked by the extend expressions. So when fed unpacked relations, expect them to produce unpacked
results :
Extending an unpacked relation :
ARG |
EXTENDSPEC |
EXTEND |
II |
BEGIN(3)END(4) |
BEGIN(1)END(2) |
BEGIN(2)END(3) |
|
boolean(true) |
EXTRA |
II |
True |
BEGIN(3)END(4) |
True |
BEGIN(1)END(2) |
True |
BEGIN(2)END(3) |
|
Note that EXTEND in particular can produce results that are packed even if the input is unpacked (namely if the extend expressions involve non-injective functions of the point intervals) :
Extend producing a packed relation from an unpacked one (sort of "by accident") :
ARG |
EXTENDSPEC |
EXTEND |
II |
BEGIN(3)END(4) |
BEGIN(1)END(2) |
BEGIN(2)END(3) |
|
extra(the_from(ii)) |
EXTRA |
II |
1 |
BEGIN(1)END(2) |
3 |
BEGIN(3)END(4) |
2 |
BEGIN(2)END(3) |
|
-
Union
The union operator can obviously cause an overlapping (or adjacency) of tuples in its output, if no special measures are taken. SIRA_PRISE's implementation of the operator does take these special measures :
Query result :
ARG2 |
ARG1 |
UNION |
|
|
|
-
Unionusing
If the default behaviour of UNION is not what you need, e.g. you explicitly want the "plain" union that keeps the "overlapping tuples", or you want a packing that is different from the default one (remember that the default packing is on all the interval-typed attributes), you need to use the UNIONUSING
operator, which allows you more control over the packing.
The UNIONUSING operator takes the same arguments as UNION, but has an additional "pack attributes list" argument. If you leave an interval-typed attribute of the relations out of the pack list, then this attribute will not be treated in "pointwise" mode by the UNION, and could possibly yield overlapping
values in the result :
UNION with not all interval-typed attributes in "pointwise" mode :
ARG1 |
ARG2 |
USINGSPEC |
UNIONUSING |
|
|
() |
II |
BEGIN(1)END(4) |
BEGIN(2)END(5) |
|
-
Intersection
Supposing that the world-record history for each discipline is kept in distinct relations, we might have the following relations (for, say, long-jump and 100m respectively) :
NAME |
DURING |
Bob Beamon |
1968-07-24 - 1998-06-18 |
NAME |
DURING |
Bob Beamon |
1968-02-24 - 1969-05-05 |
It is obvious that queries of the nature "Which athletes have been the world-record holder for both the long-jump and the 100m ?" should be answered using the intersection operator. But the intersection operator as defined before, would produce an empty relation, because there is no tuple that appears in
both relations ! Indeed, the DURING attribute in both tuples is unequal. This problem is dealt with by saying that when the intersection operator is applied to relations holding interval-typed attributes, it is sufficient that the interval-typed attribute values overlap (instead of having to be equal), and
that the value returned for such an attribute is the interval intersection of the attribute values in the "intersected tuples".
In the example, this would yield the relation :
NAME |
DURING |
Bob Beamon |
1968-07-24 - 1969-05-05 |
And that's indeed the period during which, according to the relations of the example, Bob Beamon held the world record in both disciplines.
(Allthough it has nothing to do with the RA per se, this example also illustrates an issue that is crucial enough in temporal database design to mention it explicitly here : whether this result is indeed the correct answer to the stated problem, is somewhat doubtful. The stated problem was "Which
athletes have been the world-record holder for both the long-jump and the 100m ?". Note carefully that there is no mention of a requirement to hold a world record in both disciplines _at the same moment in time_. Nor is there any mention of a relaxation that both world records must not have been held _at the
same moment in time_. This issue has more to do with how we express our queries in natual language (and how apt or inapt natural language happens to be for that purpose), but especially when dealing with temporal information, it is an important potential source of misunderstandings between users and IT
people.)
To revert to our actual subject, we continue with a question that might have arisen. If the intersection operator is "re-defined" to treat interval values by checking, not whether they are equal, but instead only whether they overlap, then what do I do if I really want the check to be for equality ? In
our example, how do I get the intersection operator to produce an empty relation value, precisely on account of the fact that the intervals are not equal ? It will not usually be the case that this behaviour is the needed or desired one, but it cannot in general be excluded either.
So obviously, the conclusion seems to be that it must be possible for the user to specify which of the two conditions ("overlap" vs. "equal") must be used for some particular invocation of the operator. Furthermore, this must be possible for each individual interval-typed attribute that the relations have.
Indeed, it might be the case that there is more than one interval-typed attribute in the relation, and that for some particular invocation of intersect, the "equality" behaviour is desired for one of those attributes, but the "overlaps" behaviour for the other. That problem is addressed by the INTERSECTUSING operator.
-
INTERSECTUSING
Analogous to UNIONUSING, the intersection operator also has an additional version that allows the user more control over the intersection and packing behaviour :
Intersection with interval-typed attributes not treated in "pointwise" modus :
ARG2 |
ARG1 |
USINGSPEC |
INTERSECTUSING |
|
|
() |
|
-
Difference
It will now be intuitively obvious how the relational difference operator deals with interval types. Supposing the world-record relations are as in the intersection example, but the question is now "Which athletes have been (and during which period have they been) world-record holder for the 100m, but
not for the long-jump ?".
The answer to this question is given by invoking the relational difference operator, that is defined to deal with interval values such as to return :
NAME |
DURING |
Bob Beamon |
1968-02-24 - 1968-07-24 |
So the relational difference operator deals with interval values by computing their interval difference, and including a tuple in the result only if that interval difference "still contains at least one point".
-
MINUSUSING
It will also be intuitively obvious how the relational difference operator also comes in a version that allows the user more control over the "difference" behaviour and over the packing :
Relational difference with interval-typed attributes not treated in "pointwise" modus :
MINUEND |
SUBTRAHEND |
USINGSPEC |
MINUSUSING |
|
|
() |
|
-
Join
If in an invocation of the join operator, the set of join attributes contains an interval-typed attribute, then join deals with that attribute in exactly the same way as defined for intersection : it is not necessary for the interval values to be equal for a tuple to be included in the result of the
join, instead it is sufficient that they overlap (i.e. have at least one value of their "underlying type" in common). The value returned for that attribute in the result of the join is the interval intersection of both interval values.
-
JOINUSING
It will also be intuitively obvious how the join operator also comes in a version that allows the user more control over the "difference" behaviour and over the packing :
JOIN with some interval-typed attributes not treated in "pointwise" modus :
ARG1 |
ARG2 |
USINGSPEC |
JOINUSING |
|
|
() |
|
-
Division
Division is a problematic operator to define a "temporal" counterpart for. The most important reason for this state of affairs being, I believe, that there is no single, unique sensible way to incorporate the "temporal" aspect into the formulation of the kind of queries that relational division was aimed
to solve.
Recall from 4.4.9 that division was aimed to solve the kind of query like "get the list of customers who already have all possible types of insurance policy", or "employees who have followed all training programs they were eligible for", or some such.
There is no single and universally applicable answer to the question "Must those customers have had all possible types of policy at the same point in time, or is it sufficient that they have had at least one policy of all possible types at just any point in time -not necessarily the same point in time-
?". Nor is it obvious whether "all possible types of policy" refers to "all types of policy that were possible at one particular moment in time" or to "all types of policy that have been possible throughout all times", or still some other different possible interpretation.
For this reason (and possibly still others), a temporal counterpart to relational division is simply not defined. The relational division operator, if, when and where supported, therefore operates on relations holding interval-typed attributes in exactly the same way as it does on relations not holding
such attributes. It is, however, very unlikely that the results produced by relational division when interval-typed attributes are in play, can make very much sense.
-
Semijoin
In section 4.4.10, the semijoin operator was defined to be equivalent to the projection of a join of the two arguments. The semijoin of relations holding interval-typed attributes is defined in the same way. But note that the result of the operator is influenced by how the "join" part of the operator
deals with interval-typed values. Note that despite the fact that the "outer" operator of the equivalent expression is projection, which potentially gives rise to non-packed relations, this phenomenon cannot occur with semijoin.
-
Using Semijoin
Just like the "basic" join operator, semijoin comes with an additional "using" variant that gives the user more control over which interval-typed attributes are treated in "pointwise" modus, and which as "regular scalar attributes". The definition is analogous to that of "default" semijoin : It is the
projection onto the attributes of the first argument, of the using-join of the two arguments for the specified pack attributes list.
-
Semidifference
Like semijoin, the definition of semidifference does not need to be altered for dealing with interval-typed attributes.
-
Using Semidifference
Just like the "basic" join operator, semiminus comes with an additional "using" variant that gives the user more control over which interval-typed attributes are treated in "pointwise" modus, and which as "regular scalar attributes". The definition is analogous to that of "default" semijoin : Its details
should be obvious and are not specified here.
-
Symmetric difference
Earlier on, the symmetric difference operator was defined to be a shorthand for some specific combination of relational unions, differences and intersections. This means that symmetric difference "inherits" the way interval-typed attributes are treated, from the UNION, MINUS and INTERSECT operators in
terms of which it is defined.
If this is not what the user needs, the XMINUSUSING operator is offered as a solution, which is the obvious (i.e. analogous to XMINUS) shorthand for a similar combination of UNIONSUSINGs, MINUSUSINGs and INTERSECTUSINGs.
-
Using Symmetric difference
The XMINUSUSING operator is the obvious (i.e. analogous to XMINUS) shorthand for a similar combination of UNIONSUSINGs, MINUSUSINGs and INTERSECTUSINGs.
-
left join
Earlier on, the left join operator was defined to be a shorthand for some specific combination of relational join, union, extend and semiminus. This means that left join "inherits" the way interval-typed attributes are treated, from the UNION, JOIN, SEMIMINUS and EXTEND operators in terms of which it is
defined. Note in particular that as a consequence of this definition, there are possibly two distinct sets of pack attributes in play :
- The one that applies to the JOIN and SEMIMINUS parts, which are only the interval-typed attributes that are common to the two relations, and
- The one that applies to the UNION part, which are all the interval-typed attributes of the left join's first argument
There is a difference between these two if the left join's first argument has any interval-typed attribute that the second argument does not have. In intricate and/or exotic cases, this may cause results that may look rather weird at first sight.
-
LEFTJOINUSING
The LEFTJOINUSING operator is the obvious (i.e. analogous to LEFTJOIN) shorthand for a similar combination of UNIONSUSINGs, EXTENDs, SEMIMINUSUSINGs and JOINUSINGs.
Note however that LEFTJOINUSING will take _only a single_ list of pack attributes and will apply this single list of pack attributes to both the UNIONSUSING part and the SEMIMINUSUSING/JOINUSING part. There is NO way to take control over the "USING" that applies to the UNION part, separately from taking
control over the "USING" that applies to the SEMIMINUSUSING/JOINUSING parts.
-
Transitive closure
Defining a "temporal counterpart" for transitive closure is somewhat subject to the very same problem as defining a "temporal counterpart" for relational division.
So despite the fact that "parts explosions" and similar problems regularly show up in the IT world, and despite the fact that "attaching a temporal connotation" to this kind of information is quite conceivable, a "temporal counterpart" of the operator that is aimed to address this category of problems is
currently not defined.
-
Group/Ungroup
The group operator is affected in two distinct ways by the possible occurrence of interval-typed attributes, the first of which is the case when there are multiple group specifications. Consider the following hypothetical relation value :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1998-06-18 |
Bob Beamon |
100m |
1968-02-24 - 1969-05-05 |
If this relation were grouped (without "special measures" to cater for the interval-typed attributes) per NAME, with the two grouping specificiation RVA_DISCIPLINE(DISCIPLINE) and RVA_DURING(DURING), then the relation-valued attribute RVA_DURING would contain a relation with two tuples, each holding a
DURING value that overlaps.
The second way in which GROUP is affected by the presence of interval-typed attributes, is best illustrated by considering the foregoing hypothetical relation, and how it must be processed in order to obtain the answer to the query "For each distinct set of disciplines that Bob Beamon has been world
record holder in, show the period during which he held the world record in all of the disciplines in that set". The result would have to be something like this :
NAME |
DISCIPLINES |
DURING |
Bob Beamon |
|
1968-02-24 - 1968-07-24 |
Bob Beamon |
DISCIPLINE |
100m |
Long Jump |
|
1968-07-24 - 1969-05-05 |
Bob Beamon |
|
1969-05-05 - 1998-06-18 |
Obtaining such results is the subject of TDATRM's USING-GROUP operator. Note that SIRA_PRISE currently does not support such an operator (it's on the todo list), and obtaining these resutls is possible only if :
- The interval type is based on an ordinal type (meaning all of NEXT,FIRST,LAST and PRIOR) are available for the underlying point type, and
- UNPACK is used explicitly to obtain a relation with single-point interval values that can be grouped over on a per-day basis, after which
- PACK is used to re-group the tuples that have the same DISCIPLINES RVA value for adjacent days.
Ungroup is also affected in the multi-element ungroup version. Consider following relation :
NAME |
RVA_DISCIPLINE |
RVA_DURING |
Bob Beamon |
rva_dis1 |
rva_dur1 |
Bob Beamon |
rva_dis2 |
rva_dur2 |
where the respective relation values of the RVA_ attributes are as follows :
rva_dis1
DISCIPLINE |
Long Jump |
100m |
rva_dis2
rva_dur1
DURING |
1968-07-24 - 1998-06-18 |
rva_dur2
DURING |
1968-02-24 - 1969-05-05 |
Note that this relation cannot possibly be produced as the result of an invocation of the group operator. Nevertheless, under some interpretations, it is possible to define a result for a multi-element ungroup invocation on this relation.
Under such an interpretation, overlapping tuples could potentially be produced for the longjump discipline :
NAME |
DISCIPLINE |
DURING |
Bob Beamon |
Long Jump |
1968-07-24 - 1998-06-18 |
Bob Beamon |
Long Jump |
1968-02-24 - 1969-05-05 |
Bob Beamon |
100m |
1968-02-24 - 1969-05-05 |
To repeat, such a multi-element version of ungroup is not commonly agreed upon.
-
Aggregate / Summarizeby
The complications that arise when interval types and data aggregations come together are so great that it is beyond the scope of this introductory text to treat the subject. (For one, note that SUMMARIZEBY has connotations similar to GROUP, so if interval-typed attributes are part of the "BY" key, it can
be expected that "partitioning" according to the overlapping of the interval values in those attributes will be required, just as it is for USING-GROUP discussed in the previous section.)
-
A remark on associativity
The relational operators UNIONUSING, INTERSECTUSING, XMINUSUSING and JOINUSING have been discussed here in their most common form, taking two relational arguments plus a pack specification. However, these four operators also "inherit" the property of associativity from their counterparts
UNION,JOIN,INTERSECT and XMINUS. It is therefore desirable to also support invocations of the "USING" versions of these operators that take 3,4,..., any number of relations as an argument. SIRA_PRISE supports this, but there is a convention to be followed regarding where in the argument list the pack attributes
list is to be specified. That convention is as follows : the USING versions of these four operators will always try to interpret the _LAST_ argument as the pack attributes list.
As is the case with the "standard" verions of these operators, the theoretical possibility of invocations of these operators with only one argument, or with no arguments at all, is syntactically not supported by SIRA_PRISE. Unlike the "standard" versions of these operators, it is also impossible to use
these four associative "USING" operators as an aggregation operator within an AGGREGATE invocation, since no syntactic device is available to specify the using list, inside the context of an aggregation spec.
|