Rao, J., Pirahesh, H., & Zuzarte, C. (2004). Canonical abstraction for outerjoin optimization. SIGMOD, 671. https://doi.org/10.1145/1007568.1007643
This paper presents the CBA (compensation-based approach) to solve the join reordering problem. It introduces two compensation operators, through which outer join queries could be converted into a canonical representation, just like inner join queries.
The authors are inspired by the fact that an inner join query can be represented by:
- A sequence of Cartesian products of all relations; and
- A sequence of selection operations.
Then, the paper purposes a similar representation for outer join queries:
- A sequence of outer Cartesian products of all relations; and
- A sequence of nullification operations; and
- A best-match operator.
The deep reason why the authors try to pursue such a canonical abstraction is: with this nice form, we can achieve full commutativity and transitivity among all relations in the query. In this paper, a outer join query is defined to be one query with any number of left outerjoins and inner joins. Such a query can be represented by a binary tree, in which the leaf nodes are base relations and internal nodes are join predicates.
Then, we define 3 operators used in the canonical abstraction:
- Nullification operator: for tuples in which the predicate
p
does not evaluate totrue
, attributes inA
will be set tonull
; - Best-match operator: removes all spurious tuples (either duplicates, or dominated by other tuples);
- Outer Cartesian product: similar to normal Cartesian product, with the difference that when R is empty, it returns all tuples in S and pads
null
in all attributes in R.
Then, we summarize that an outer join predicate nullifies the null-producing relation, while an inner join predicate nullifies both relations. The paper then points out that nullification operators themselves are not commutative in general. Two nullification operators are commutative if they nullify the same set of attributes. This is because, when the outer nullification is interchanged to be the inner one, the null
s generated by the original inner nullification (current outer nullification) will no longer be seen by the original outer nullification (current inner nullification). Let’s revisit the definition of nullification operator: it says the predicate p
does not evaluate to true
. In other words, it will nullify the attributes when p
is either false
or null
. Therefore, if the null
s generated are not seen, the result will be different.
Now, the last step before achieving a canonical abstraction for outer join queries is that nullification operators are not interchangeable. Under the condition that all predicates are null-intolerant, the paper tries to make nullification operators interchangable. The addition of these implied nullification operations will make nullification operators interchangable. The reason is that these implied nullification will ensure that all null
s will be generated regardless of the order of nullification operators. An additional benefit of applying these implied nullification is that transitivity will be applied across two outer joins.
Then, the paper introduces the concept of nullification sets. Given a relation R
, its nullification set NSr
is a set of all predicates that can nullify R
if not true
. Then an algorithm is presented to populate all nullification sets by iterating the query tree from bottom-up in the post-order.
- For an outer join node, the join predicate
p
nullifies each relation in the null producing side. For each relationt
that is from the left side (preserving side) and is referenced byp
, all predicates in its nullification setNSt
will nullifyp
and thus nullify each relation in the right side (null producing side) as well. - For an inner join node, the join predicate
p
nullifies each relation from both sides. Thus, for relations referenced byp
, their nullification sets should be merged as well. - After the bottom-up post-order traversal, iterate through each nullification set to generate additional predicates by transitivity.
It is guaranteed that all nullification sets in the operator tree are complete upon one bottom-up post-order traversal. The nullification sets generated have 2 properties:
- Applying the nullification operation in any order is equivalent to the original order. In other words, the nullification operators are interchangable;
- Two equivalent outer join queries have the same nullification sets.
Since the nullification operators are interchangable now, we have fixed the last gap before achieving the canonical abstraction for an outer join query. Now, we can enjoy the benefits of better join reorderability for outer join queries. With the nullification sets introduced, we have an additional benefit that predicates within the same ON
clause can be split and do not have to be applied together.
Then, the authors try to organize these nullification sets into a directed graph. Each node represent a unique nullification set (if the nullification sets of two relations are the same, they belong to the same node). There is an edge from node a
to node b
if NSa
is a subset of NSb
. This graph is referred to as DAGns
. The DAG has the following properties:
- This DAG has only 1 root and all nodes can be reached via a path from this root node.
- For a node
n
in this DAG, predicates inNSn
can only refer to relations assigned to nodes in the path from root ton
. - Relations assigned to the same node
n
in this DAG must have been inner joined together in the original operator tree, using predicates inNSn
. - For each pair of parent & child nodes, they should be outer joined together with the child as the null producing side, using the predicates in
NSchild - NSparent
.
Following up, the paper tries to purpose a DP-style bottom-up plan enumeration algorithm. To distinguish among plans corresponding to the same logical expression, plan properties are used and stored along the way. To extend the optimizer for outer join support, we need to add a plan property to store the nullification set after each join node. During the enumeration process, the generation of nullification sets is tracked and any missing nullification operation will be added in the end.
To implement a best-match operator, the paper purposes to do a sorting to obtain a favorable ordering (in which null
s are sorted last). Then, we can filter out all spurious tuples via a one-pass iteration. To get access to the previous tuple, the authors purpose to use the OVER ... WINDOW
clause in SQL:1999 standard. This can also help us deal with the special case where the first tuple does not have a previous tuple.
Questions
- Why beta(R) = R when R is a base relation?
- For
ON
clause with multiple predicates, it seems that the paper only discusses the scenario where they are connected usingAND
. But what if they are connected usingOR
? - Why did the authors set the buffer size to be a little over 1GB to make sure all data can reside in memory?