Inner join、Outer join、Full join中on与where的区别

Mar 20, 2021, 9:58 am

背景

本文主要讨论hive中,不同join方式下on条件和where条件的区别,同时关注hive中如何执行语句。比如谓词下推,就是其中一种优化技术。

原表

person表

person.id       person.name     person.age
3 mili 21
4 tom 19
5 mike 18
8 nul 20

account表

account.id      account.account
1 2222
2 2323
3 1111
4 1212
5 5555
6 6666

Inner join

通常情况下简写为join
inner join,查询条件在on中和where中,执行结果上并没有不同。
比如,下面几个查询语句执行结果相同

1
2
3
4
5
6
select * from person join account on person.id=account.id and person.id=4;
select * from person join account on person.id=account.id and account.id=4;
select * from person join account on person.id=account.id where person.id=4;
select * from person join account on person.id=account.id where account.id=4;
select * from person join account on person.id=account.id and person.id=4 and account.id=4;
select * from person join account on person.id=account.id where person.id=4 and account.id=4;

执行结果

person.id       person.name     person.age      account.id      account.account
4 tom 19 4 1212

通过explain查看上述执行语句转换的mapreduce任务,所有的任务都被转化为相同的执行计划:

hive (test)> explain select * from person join account on person.id=account.id and person.id=4 and account.id=4;
OK
Explain
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
person
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
person
TableScan
alias: person
Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id = 4) (type: boolean)
Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 4 (type: int)
1 4 (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: account
Statistics: Num rows: 5 Data size: 42 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id = 4) (type: boolean)
Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 4 (type: int)
1 4 (type: int)
outputColumnNames: _col1, _col2, _col7
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: 4 (type: int), _col1 (type: string), _col2 (type: int), 4 (type: int), _col7 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.253 seconds, Fetched: 62 row(s)

不管是account表还是person表,TableScan中均包含“predicate: (id = 4)”这条查询条件,这就意味着读取2个表的数据是先根据条件筛选数据,然后再做join操作。
总结:inner join中筛选条件在on或where中,没有差异,因为hive在底层做了逻辑方面的优化,不同的语句被转化为相同的执行任务。

Outer join

以left outer join为例, 通常简写为left join. on条件不会影响左表返回的结果,仅影响右表,where条件会影响左表返回结果。

1
2
# sql 1.1
select * from person left join account on person.id=account.id and person.id=4;

执行结果:

person.id       person.name     person.age      account.id      account.account
3 mili 21 NULL NULL
4 tom 19 4 1212
5 mike 18 NULL NULL
8 nul 20 NULL NULL

执行计划:
无论左表还是右表,均全量读取。
1
2
# sql 1.2
select * from person left join account on person.id=account.id and account.id=4;

执行结果:

同sql 1.1

执行计划:

hive (test)> explain select * from person left join account on person.id=account.id and account.id=4;
OK
Explain
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
account
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
account
TableScan
alias: account
Statistics: Num rows: 5 Data size: 42 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id = 4) (type: boolean)
Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 id (type: int)
1 id (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: person
Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col1, _col2, _col6, _col7
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.349 seconds, Fetched: 59 row(s)

左表全量读取,右表仅读取account.id=4的数据。
1
2
# sql 1.3
select * from person left join account on person.id=account.id and person.id=4 and account.id=4;

执行结果:

同sql 1.2

执行计划:
同sql 1.2。
等价于person.id加不加都不影响结果和执行过程。
1
2
# sql 1.4
select * from person left join account on person.id=account.id where person.id=4;

执行结果:

person.id       person.name     person.age      account.id      account.account
4 tom 19 4 1212

执行计划:

hive (test)> explain select * from person left join account on person.id=account.id where person.id=4;
OK
Explain
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
account
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
account
TableScan
alias: account
Statistics: Num rows: 5 Data size: 42 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id = 4) (type: boolean)
Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 id (type: int)
1 id (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: person
Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id = 4) (type: boolean)
Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col1, _col2, _col6, _col7
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: 4 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.228 seconds, Fetched: 62 row(s)

执行计划表明,无论是左表还是右表,均在读取数据时做了筛选,然后执行join。
1
2
# sql 1.5
select * from person left join account on person.id=account.id where account.id=4;

执行结果:

同sql 1.4

执行计划:
无论左表还是右表,均全量读取。
1
2
# sql 1.6
select * from person left join account on person.id=account.id where person.id=4 and account.id=4;

执行结果:

同sql 1.4

执行计划:
同sql1.4。

Full join

1
2
# sql 2.1
select * from person full join account on person.id=account.id;

执行结果:

person.id       person.name     person.age      account.id      account.account
NULL NULL NULL 1 2222
NULL NULL NULL 2 2323
3 mili 21 3 1111
4 tom 19 4 1212
5 mike 18 5 5555
NULL NULL NULL 6 6666
8 nul 20 NULL NULL

执行计划:
左表与右表均全量读取。
1
2
# sql 2.2
select * from person full join account on person.id=account.id and person.id=4;

执行结果:

person.id       person.name     person.age      account.id      account.account
NULL NULL NULL 1 2222
NULL NULL NULL 2 2323
3 mili 21 NULL NULL
NULL NULL NULL 3 1111
4 tom 19 4 1212
5 mike 18 NULL NULL
NULL NULL NULL 5 5555
NULL NULL NULL 6 6666
8 nul 20 NULL NULL

执行计划:
左表与右表均全量读取。
1
2
# sql 2.3
select * from person full join account on person.id=account.id and account.id=4;

执行结果:

同sql 2.2

执行计划:
同sql 2.2.
1
2
# sql 2.4
select * from person full join account on person.id=account.id where person.id=4;

执行结果:

person.id       person.name     person.age      account.id      account.account
4 tom 19 4 1212

执行计划:
左表与右表均全量读取。
1
2
# sql 2.5
select * from person full join account on person.id=account.id where account.id=4;

执行结果:

同sql 2.4

执行计划:
同sql 2.4.
1
2
# sql 2.6
select * from person full join account on person.id=account.id where person.id=4 and account.id=4;

执行结果:

同sql 2.4

执行计划:
同sql 2.4.