Wang, T., & Chan, C.-Y. (2018). Improving Join Reorderability with Compensation Operators. SIGMOD, 693–708. https://doi.org/10.1145/3183713.3183731
This paper presents a new approach, named as enhanced compensation approach (ECA). This new approach addes 2 more compensation operators to better support semijoins and antijoins. It provides better join reorderability compared to the 2 state-of-the-art approaches, TBA and CBA. This paper makes 4 contributions:
- Introduce a precise formalization of complete join reorderability;
- Develop the new ECA approach ECA for the join reordering problem;
- Achieve complete join reorderability for single-sided outer join, semijoins, antijoins;
- Achieve partial join reorderability for full (double-sided) outer join (but still better than TBA and CBA).
- Design a top-down plan enumeration algorithm to compute the optimal query plans;
- Perform an experiement using ECA’s evaluation.
This paper then gives a summary of the key ideas behind TBA and CBA. TBA uses 3 properties (associativity, left asscom & right asscom) to derive all valid transformations. While, CBA tries to re-write all queries into a canonical representation using the nullification operator and the best-match operator. CBA finds out that the predicate of an inner join nullifies both sides, while a single-sided outer join nullifies the null-producing side only. CBA can enable more join reorderability because the outer Cartesian products are both commutative and associative. The optimizer shall always reorder this part first to the desired ordering.
Then, the authors give a formal definition of join ordering and join reorderability. A join ordering can be represented by an un-ordered binary tree, in which the internal nodes are the join predicates and the leaf nodes are the individual relations. The predicates stored at each internal node must satisfy 2 conditions:
- The relations referenced by this predicate must all be within the subtrees rooted at this internal node;
- The predicate must reference some relation(s) from each subtree rooted at this internal node.
Personally, I think the 2nd condition does not alway hold. For example, it does not hold for Cartesian products. However, it is a conventional heuristics that the query optimizer tries to avoid “bad plans”. Thus, cross products will not be considered by most database systems. Therefore, this small glitch does not affect the discussion in this paper.
Notice: only the join predicates (rather than the join operators) are stored in the internal nodes. Thus, a join ordering only records the order of the join operands. Given a specific join ordering T
, the actual join operators (i.e., inner join, outer join, semijoin, antijoin) used are not specified.
Further on, let’s say we are a given a set of un-ordered binary trees, which contain all join orderings of a given query Q
. Q
is reorderable with respect to a given join ordering T
and a set of compensation operators P
if
Q
can be rewritten into an equivalent queryQ'
;- Join operands in
Q'
follow the order given byT
; - The compensation operators in
P
could possibly be used.
In addition, let’s we are given a query class C
and a set of compensation operators P
. C
is completely reorderable with respective to P
if: for every query Q
in C
and any of its ordering T
, Q
is reorderable with respect to T
and P
.
In a nutshell, TBA uses an empty set of compensation operators, thus brings very limited reorderability. CBA uses 2 compensation operators, thus brings much better reorderability for inner joins and single-sided outer joins. Since semijoin can be rewritten as an inner join followed by a projection, CBA helps to improve reorderability for that as well. In this paper, ECA adds 2 more compensation operators to improve reorderability for antijoins as well.
In ECA, the following 4 compensation operators are used:
- Nullification operator: for tuples in which the predicate
p
does not evaluate totrue
, attributes inA
will be set tonull
; - Best-match operator: removes all spurious tuples (either duplicates, or dominated by other tuples);
- Lambda operator: removes all tuples where attributes in
A
are not null; - Lambda star operator: nullifies all attributes (except those in
B
) in tuples where attributes inA
are not null.
Then, the paper purposes the properties of the 2 new compensation operators and the rewriting rule for antijoin. It converts an antijoin to be a single-sided outer join, followed by a pruning using lambda operator, with a projection in the end. Then, the join orderings enabled by this new rewriting rule are given.
For free reordering, one may need to pull up the nullification operators and lambda operators (potentially changing it to the lambda star operators). This is to avoid the nullification operator and the lambda operator being sandwiched between two join operators.
Then, the paper presents a top-down plan enumeration algorithm. A top-down plan enumerator basically divides the given query into two disjoint non-empty sets, and then recursively apply the algorithm on two sets. To improve performance, the best query subplan for a given set can be cached. You may want to use a bitmap to represent a set. However, the challenge here is: by introducing the compensation operators, the optimal query plan could be different even for the same set of relations. To solve this, each query subplan SP
is enumerated within a certain context, which is a complete query plan P
.
To divide a query into 2 disjoint sets, we introduce the concept of joinable pairs. Given a set of relations S
in a query, two subsets S1
and S2
become a joinable pair if:
S = S1 + S2
;- there exists a join ordering
T
in which there is an internal node such thatS1
andS2
are the immediate subtrees of that node.
Each joinable pair has an additional property that S1
and S2
become a joinable pair with join node j
if j
is the only node in query P
that refers to some relation in each of S1
and S2
.
Before recursively calling GenerateSubplan
, we need to swap the join node to become the immediate child of i
. Also, to embed a certain query subplan in a given context, SP
is represented by subtree(P, S)
, where P
being a complete query plan is the context. Also, subtree(P, S)
is the smallest subtree that satisfies the following 2 properties:
SP
contains all relations inS
; andSP
contains all the compensation operators that are between the root join node and the nearest ancestor join node.
The 2nd property is important to differentiate the subplans for the same set of relations but with different compensation operators. It actually also implies that we may need to pull up the nullification operators and lambda operators.
The paper then purposes a possible refinement to enable the use of query subplans. A join operator depends on compensation operators when:
- If a compensation operator is generated when a join operator is swapped above another join operator, then both join operators depend on it; or
- If a compensation operator is swapped above a join operator and causes the join type to be changed, then the join operator depends on it.
To capture such dependencies, a new labeled edge type is introduced into the query graph. The paper named it as dependency edge (d-edge). The d-edges are equivalent if both the source node (i.e., join operator) and label are the same. We further define d-edges whose source nodes are within the subplan but destination node are not as external d-edges. Two subplans are considered to be equivalent if their external d-edges are equivalent.
The authors also mentioned the reason why a bottom-up plan enumeration algorithm is hard. Due to the existence of compensation operators, we have to keep track of the subplan equivalence.
Regarding the implementation of compensation operators, there could be 2 approaches: a SQL-based approach and a system-level native approach. This paper shows that it is simple to implement all operators using SQL approach. Regarding cost model, the time complexity for nullification operator and lambda operator is O(n)
, while the time complexity for best-match operator and lambda star operator is O(n logn)
.
Questions
- In the swap function, how is the
assoc
,l-asscom
andr-asscom
determined? - How does top-down approach help to keep track of the equivalence of subplans in the presence of compensation operators?
- Using the SQL approach to implement the compensation operators, how can we enforce the engine to query using the specified ordering?
- Can we use only the 2 original operators since the 2 new operators can be represented using the 2 old operators?