Join in SQL
join
is a kind of SQL operation in relational algebra, combining columns cross tables.
Semantics of Join
The ANSI-standard SQL specifies five different types of join
operations: INNER
, LEFT OUTER
,
RIGHT OUTER
, FULL OUTER
and CROSS
.
-
Cross Join
- Cross join returns the Cartesian product of matched rows from two tables.
- Syntax:
select * from TA cross join TB [where pred]
-
Explanation:
\[[ ra .. rb | ra \leftarrow TA, rb \leftarrow TB | pred(ra, rb) ]\]
-
Inner Join
- Inner join requires each row in the two joined tables to have matching column values. The query compares each row of A and B to find all pairs of rows that satisfy the join predicate.
- Syntax:
select * from TA inner join TB on matchpred
-
Explanation:
\[[ ra .. rb | ra \leftarrow TA, rb \leftarrow TB | matchpred(ra, rb) ]\] -
Implementation: the result of inner join can be seen as filter the result set of cross join (Cartesian product). Actually SQL implementations normally use other more efficient approaches:
- hash join: apply a hash function to the join attribute.
- sort-merge join: first sort the rows by the join attribute.
-
Outer Join
- Outer join retains each matched rows from table, event if no corresponding matching row exists. Outer join subdivide further into left outer joins, right outer joins and full outer joins, depending on which table’s rows to retain.
- The corresponding columns are
NULL
if no matching row exists for a row in result set. - Syntax:
select * from TA (left/right/full) outer join TB on matchpred
-
Explanation:
\[[ (ra .. rb) \texttt{or} (ra .. NULL) | ra \leftarrow TA, rb \leftarrow TB | matchpred(ra, rb) ]\] - In SQLite, only left outer join is supported currently.
Optimize Join in OLAP
Join as Map
Under certain circumstances, for example in Spark, the SQL operation are executed in a distributed manner.
The join
operation will trigger shuffle. When there exists some level of data skew, where a lot of
data records shared the same key, the shuffle may route those data to a single worker, leading to out of
memory error, or long-tail problem.
When one of the operands of join
is fairly small than another one, the join
operation could be transformed
as a map
operation. First, create a broadcast variable to transfer the smaller RDD to every nodes, then
use a custom map
operation (filter
may also be involved) to gain the same outcome of the original join
request.