join is a kind of SQL operation in relational algebra, combining columns cross tables.
Semantics of Join
The ANSI-standard SQL specifies five different types of
FULL OUTER and
- Cross join returns the Cartesian product of matched rows from two tables.
select * from TA cross join TB [where pred]
- Inner join requires each row in the two joined tables to have matching column values. The query compares each row of A and B to find all pairs of rows that satisfy the join predicate.
select * from TA inner join TB on matchpred
Implementation: the result of inner join can be seen as filter the result set of cross join (Cartesian product). Actually SQL implementations normally use other more efficient approaches:
- hash join: apply a hash function to the join attribute.
- sort-merge join: first sort the rows by the join attribute.
- Outer join retains each matched rows from table, event if no corresponding matching row exists. Outer join subdivide further into left outer joins, right outer joins and full outer joins, depending on which table’s rows to retain.
- The corresponding columns are
NULLif no matching row exists for a row in result set.
select * from TA (left/right/full) outer join TB on matchpred
- In SQLite, only left outer join is supported currently.
Optimize Join in OLAP
Join as Map
Under certain circumstances, for example in Spark, the SQL operation are executed in a distributed manner.
join operation will trigger shuffle. When there exists some level of data skew, where a lot of
data records shared the same key, the shuffle may route those data to a single worker, leading to out of
memory error, or long-tail problem.
When one of the operands of
join is fairly small than another one, the
join operation could be transformed
map operation. First, create a broadcast variable to transfer the smaller RDD to every nodes, then
use a custom
map operation (
filter may also be involved) to gain the same outcome of the original