Inner join、Outer join、Full join中on与where的区别

背景

本文主要讨论hive中，不同join方式下on条件和where条件的区别，同时关注hive中如何执行语句。比如谓词下推，就是其中一种优化技术。

原表

person表

person.id       person.name     person.age
3       mili    21
4       tom     19
5       mike    18
8       nul     20

account表

account.id      account.account
1       2222
2       2323
3       1111
4       1212
5       5555
6       6666

Inner join

通常情况下简写为join
inner join，查询条件在on中和where中，执行结果上并没有不同。
比如，下面几个查询语句执行结果相同

select * from person join account on person.id=account.id and person.id=4;
select * from person join account on person.id=account.id and account.id=4;
select * from person join account on person.id=account.id where person.id=4;
select * from person join account on person.id=account.id where account.id=4;
select * from person join account on person.id=account.id and person.id=4 and account.id=4;
select * from person join account on person.id=account.id where person.id=4 and account.id=4;

执行结果：

person.id       person.name     person.age      account.id      account.account
4       tom     19      4       1212

通过explain查看上述执行语句转换的mapreduce任务，所有的任务都被转化为相同的执行计划：

hive (test)> explain select * from person join account on person.id=account.id and person.id=4 and account.id=4;
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3
STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        person
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        person
          TableScan
            alias: person
            Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id = 4) (type: boolean)
              Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 4 (type: int)
                  1 4 (type: int)
  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: account
            Statistics: Num rows: 5 Data size: 42 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id = 4) (type: boolean)
              Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 4 (type: int)
                  1 4 (type: int)
                outputColumnNames: _col1, _col2, _col7
                Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: 4 (type: int), _col1 (type: string), _col2 (type: int), 4 (type: int), _col7 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4
                  Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.253 seconds, Fetched: 62 row(s)

不管是account表还是person表，TableScan中均包含“predicate: (id = 4)”这条查询条件，这就意味着读取2个表的数据是先根据条件筛选数据，然后再做join操作。
总结：inner join中筛选条件在on或where中，没有差异，因为hive在底层做了逻辑方面的优化，不同的语句被转化为相同的执行任务。

Outer join

以left outer join为例, 通常简写为left join. on条件不会影响左表返回的结果，仅影响右表，where条件会影响左表返回结果。

1 2	# sql 1.1 select * from person left join account on person.id=account.id and person.id=4;

执行结果：

person.id       person.name     person.age      account.id      account.account
3       mili    21      NULL    NULL
4       tom     19      4       1212
5       mike    18      NULL    NULL
8       nul     20      NULL    NULL

执行计划：
无论左表还是右表，均全量读取。

1 2	# sql 1.2 select * from person left join account on person.id=account.id and account.id=4;

执行结果：

同sql 1.1

执行计划：

hive (test)> explain select * from person left join account on person.id=account.id and account.id=4;
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3
STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        account
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        account
          TableScan
            alias: account
            Statistics: Num rows: 5 Data size: 42 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id = 4) (type: boolean)
              Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 id (type: int)
                  1 id (type: int)
  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: person
            Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
            Map Join Operator
              condition map:
                   Left Outer Join0 to 1
              keys:
                0 id (type: int)
                1 id (type: int)
              outputColumnNames: _col0, _col1, _col2, _col6, _col7
              Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col0 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4
                Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.349 seconds, Fetched: 59 row(s)

左表全量读取，右表仅读取account.id=4的数据。

1 2	# sql 1.3 select * from person left join account on person.id=account.id and person.id=4 and account.id=4;

执行结果：

同sql 1.2

执行计划：
同sql 1.2。
等价于person.id加不加都不影响结果和执行过程。

1 2	# sql 1.4 select * from person left join account on person.id=account.id where person.id=4;

执行结果：

person.id       person.name     person.age      account.id      account.account
4       tom     19      4       1212

执行计划：

hive (test)> explain select * from person left join account on person.id=account.id where person.id=4;
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3
STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        account
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        account
          TableScan
            alias: account
            Statistics: Num rows: 5 Data size: 42 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id = 4) (type: boolean)
              Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 id (type: int)
                  1 id (type: int)
  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: person
            Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id = 4) (type: boolean)
              Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 id (type: int)
                  1 id (type: int)
                outputColumnNames: _col1, _col2, _col6, _col7
                Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: 4 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4
                  Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 2 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.228 seconds, Fetched: 62 row(s)

执行计划表明，无论是左表还是右表，均在读取数据时做了筛选，然后执行join。

1 2	# sql 1.5 select * from person left join account on person.id=account.id where account.id=4;

执行结果：

同sql 1.4

执行计划：
无论左表还是右表，均全量读取。

1 2	# sql 1.6 select * from person left join account on person.id=account.id where person.id=4 and account.id=4;

执行结果：

同sql 1.4

执行计划：
同sql1.4。

Full join

1 2	# sql 2.1 select * from person full join account on person.id=account.id;

执行结果：

person.id       person.name     person.age      account.id      account.account
NULL    NULL    NULL    1       2222
NULL    NULL    NULL    2       2323
3       mili    21      3       1111
4       tom     19      4       1212
5       mike    18      5       5555
NULL    NULL    NULL    6       6666
8       nul     20      NULL    NULL

执行计划：
左表与右表均全量读取。

1 2	# sql 2.2 select * from person full join account on person.id=account.id and person.id=4;

执行结果：

person.id       person.name     person.age      account.id      account.account
NULL    NULL    NULL    1       2222
NULL    NULL    NULL    2       2323
3       mili    21      NULL    NULL
NULL    NULL    NULL    3       1111
4       tom     19      4       1212
5       mike    18      NULL    NULL
NULL    NULL    NULL    5       5555
NULL    NULL    NULL    6       6666
8       nul     20      NULL    NULL

执行计划：
左表与右表均全量读取。

1 2	# sql 2.3 select * from person full join account on person.id=account.id and account.id=4;

执行结果：

同sql 2.2

执行计划：
同sql 2.2.

1 2	# sql 2.4 select * from person full join account on person.id=account.id where person.id=4;

执行结果：

person.id       person.name     person.age      account.id      account.account
4       tom     19      4       1212

执行计划：
左表与右表均全量读取。

1 2	# sql 2.5 select * from person full join account on person.id=account.id where account.id=4;

执行结果：

同sql 2.4

执行计划：
同sql 2.4.

1 2	# sql 2.6 select * from person full join account on person.id=account.id where person.id=4 and account.id=4;

执行结果：

同sql 2.4

执行计划：
同sql 2.4.