hive.auto.convert.join.noconditionaltask = true;
Added in Hive 0.11.0, and it is true by default. That means, if the sum of size for n-1 of the tables/partitions for an n-way join is smaller than the size specified by hive.auto.convert.join.noconditionaltask.size(10MB by default), the join is directly converted to a mapjoin (there is no conditional task).The size configuration enables the user to control what size table can fit in memory. This value represents the sum of the sizes of tables that can be converted to hashmaps that fit in memory.
hive.auto.convert.join.noconditionaltask determines whether Hive enable the optimization about converting common join into mapjoin based on the input file size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a mapjoin (there is no conditional task).
Added in Hive 0.8.0, and it is 25MB by default which means, if the input file size is smaller than this threshold, Hive will try to convert the common join into map join. If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB.
These three parameters, jointly enable:
- Loading the small table into local memory (where Hive client is running)
- Creates hash table for that small table in local memory
- Push the hash table into HDFS.
- All the mappers load this hash table from the HDFS and does map-side join
If all but one of the tables being joined are small, the join can be performed as a map only job. The query
select /*+ MAPJOIN(b) */ a.key, a.value
from a join b ON a.key = b.key
does not need a reducer. For every mapper of A, B is read completely.
Consider the possibility of multiple mapjoins on different keys:
select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne)
JOIN smallTableTwo ON (firstjoin.idTwo = smallTableTwo.idTwo)
The above query is not supported. Without the mapjoin hint, the above query would be executed as 2 map-only jobs. If the user knows in advance that the inputs are small enough to fit in memory, the following configurable parameters can be used to make sure that the query executes in a single map-reduce job.