join is a kind of SQL operation in relational algebra, combining columns cross tables.

Semantics of Join

The ANSI-standard SQL specifies five different types of join operations: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.

  • Cross Join

    • Cross join returns the Cartesian product of matched rows from two tables.
    • Syntax: select * from TA cross join TB [where pred]
    • Explanation:

      \[[ ra .. rb | ra \leftarrow TA, rb \leftarrow TB | pred(ra, rb) ]\]
  • Inner Join

    • Inner join requires each row in the two joined tables to have matching column values. The query compares each row of A and B to find all pairs of rows that satisfy the join predicate.
    • Syntax: select * from TA inner join TB on matchpred
    • Explanation:

      \[[ ra .. rb | ra \leftarrow TA, rb \leftarrow TB | matchpred(ra, rb) ]\]
    • Implementation: the result of inner join can be seen as filter the result set of cross join (Cartesian product). Actually SQL implementations normally use other more efficient approaches:

      • hash join: apply a hash function to the join attribute.
      • sort-merge join: first sort the rows by the join attribute.
  • Outer Join

    • Outer join retains each matched rows from table, event if no corresponding matching row exists. Outer join subdivide further into left outer joins, right outer joins and full outer joins, depending on which table’s rows to retain.
    • The corresponding columns are NULL if no matching row exists for a row in result set.
    • Syntax: select * from TA (left/right/full) outer join TB on matchpred
    • Explanation:

      \[[ (ra .. rb) \texttt{or} (ra .. NULL) | ra \leftarrow TA, rb \leftarrow TB | matchpred(ra, rb) ]\]
    • In SQLite, only left outer join is supported currently.

Optimize Join in OLAP

Join as Map

Under certain circumstances, for example in Spark, the SQL operation are executed in a distributed manner. The join operation will trigger shuffle. When there exists some level of data skew, where a lot of data records shared the same key, the shuffle may route those data to a single worker, leading to out of memory error, or long-tail problem.

When one of the operands of join is fairly small than another one, the join operation could be transformed as a map operation. First, create a broadcast variable to transfer the smaller RDD to every nodes, then use a custom map operation (filter may also be involved) to gain the same outcome of the original join request.