Wang, T., & Chan, C.-Y. (2018). Improving Join Reorderability with Compensation Operators. SIGMOD, 693–708. https://doi.org/10.1145/3183713.3183731

This paper presents a new approach, named as enhanced compensation approach (ECA). This new approach addes 2 more compensation operators to better support semijoins and antijoins. It provides better join reorderability compared to the 2 state-of-the-art approaches, TBA and CBA. This paper makes 4 contributions:

This paper then gives a summary of the key ideas behind TBA and CBA. TBA uses 3 properties (associativity, left asscom & right asscom) to derive all valid transformations. While, CBA tries to re-write all queries into a canonical representation using the nullification operator and the best-match operator. CBA finds out that the predicate of an inner join nullifies both sides, while a single-sided outer join nullifies the null-producing side only. CBA can enable more join reorderability because the outer Cartesian products are both commutative and associative. The optimizer shall always reorder this part first to the desired ordering.

Then, the authors give a formal definition of join ordering and join reorderability. A join ordering can be represented by an un-ordered binary tree, in which the internal nodes are the join predicates and the leaf nodes are the individual relations. The predicates stored at each internal node must satisfy 2 conditions:

Personally, I think the 2nd condition does not alway hold. For example, it does not hold for Cartesian products. However, it is a conventional heuristics that the query optimizer tries to avoid “bad plans”. Thus, cross products will not be considered by most database systems. Therefore, this small glitch does not affect the discussion in this paper.

Notice: only the join predicates (rather than the join operators) are stored in the internal nodes. Thus, a join ordering only records the order of the join operands. Given a specific join ordering T, the actual join operators (i.e., inner join, outer join, semijoin, antijoin) used are not specified.

Further on, let’s say we are a given a set of un-ordered binary trees, which contain all join orderings of a given query Q. Q is reorderable with respect to a given join ordering T and a set of compensation operators P if

In addition, let’s we are given a query class C and a set of compensation operators P. C is completely reorderable with respective to P if: for every query Q in C and any of its ordering T, Q is reorderable with respect to T and P.

In a nutshell, TBA uses an empty set of compensation operators, thus brings very limited reorderability. CBA uses 2 compensation operators, thus brings much better reorderability for inner joins and single-sided outer joins. Since semijoin can be rewritten as an inner join followed by a projection, CBA helps to improve reorderability for that as well. In this paper, ECA adds 2 more compensation operators to improve reorderability for antijoins as well.

In ECA, the following 4 compensation operators are used:

Then, the paper purposes the properties of the 2 new compensation operators and the rewriting rule for antijoin. It converts an antijoin to be a single-sided outer join, followed by a pruning using lambda operator, with a projection in the end. Then, the join orderings enabled by this new rewriting rule are given.

For free reordering, one may need to pull up the nullification operators and lambda operators (potentially changing it to the lambda star operators). This is to avoid the nullification operator and the lambda operator being sandwiched between two join operators.

Then, the paper presents a top-down plan enumeration algorithm. A top-down plan enumerator basically divides the given query into two disjoint non-empty sets, and then recursively apply the algorithm on two sets. To improve performance, the best query subplan for a given set can be cached. You may want to use a bitmap to represent a set. However, the challenge here is: by introducing the compensation operators, the optimal query plan could be different even for the same set of relations. To solve this, each query subplan SP is enumerated within a certain context, which is a complete query plan P.

To divide a query into 2 disjoint sets, we introduce the concept of joinable pairs. Given a set of relations S in a query, two subsets S1 and S2 become a joinable pair if:

Each joinable pair has an additional property that S1 and S2 become a joinable pair with join node j if j is the only node in query P that refers to some relation in each of S1 and S2.

Before recursively calling GenerateSubplan, we need to swap the join node to become the immediate child of i. Also, to embed a certain query subplan in a given context, SP is represented by subtree(P, S), where P being a complete query plan is the context. Also, subtree(P, S) is the smallest subtree that satisfies the following 2 properties:

The 2nd property is important to differentiate the subplans for the same set of relations but with different compensation operators. It actually also implies that we may need to pull up the nullification operators and lambda operators.

The paper then purposes a possible refinement to enable the use of query subplans. A join operator depends on compensation operators when:

To capture such dependencies, a new labeled edge type is introduced into the query graph. The paper named it as dependency edge (d-edge). The d-edges are equivalent if both the source node (i.e., join operator) and label are the same. We further define d-edges whose source nodes are within the subplan but destination node are not as external d-edges. Two subplans are considered to be equivalent if their external d-edges are equivalent.

The authors also mentioned the reason why a bottom-up plan enumeration algorithm is hard. Due to the existence of compensation operators, we have to keep track of the subplan equivalence.

Regarding the implementation of compensation operators, there could be 2 approaches: a SQL-based approach and a system-level native approach. This paper shows that it is simple to implement all operators using SQL approach. Regarding cost model, the time complexity for nullification operator and lambda operator is O(n), while the time complexity for best-match operator and lambda star operator is O(n logn).

Questions