Is your Hive Query really optimized? Is it a contender for parallel execution?

Whether you have been working on traditional SQL based solutions for years or an individual who has just started his/her career, while learning HIVE, writing optimized queries is an art must learn. Here in this blog, we share about one such query which we worked upon recently and helps you understand how rewriting a query can really leverage HIVE capabilities for multiple mapreduce jobs running in parallel.

The scenario: 4 Tables, required to be joined to get us selected attribute/s from a table.

The simple form for query that is natural to think about is:

SELECT
table1.a
FROM
table1 JOIN table2 ON (table1.a =table2.a )
join table3 ON (table3.a=table1.a)
join table4 ON (table4.b=table3.b);

Output:
⦁ Works fine as expected!
⦁ Execution time : 800 secs

But let us check the execution plan for this:

serially

picture 1

NOTE: Only a clipping of the total execution plan output is shown here for easy understanding.

Some observations (see picture 1 highlighted area):

  1.  Total Map-Reduce Jobs: 2.
  2. Serially Launched & Run.

Thus, as we all know that all HIVE queries are inbuilt translated to a number of map reduce jobs, but having multiple Map-reduce jobs is not enough, real advantage is of their parallel execution and as noted above simply writing a query does not achieve this.

To achieve this, we thought about query re-writing in a way to segregate the query into independent units which HIVE could work upon as independent map reduce jobs running parallely. Following is what we did to our query:

SELECT
r1.a
FROM
(SELECT table1.a FROM table1 JOIN table2 ON table1.a =table2.a ) r1
JOIN
(SELECT table3.a FROM table3 JOIN table4 ON table3.b =table4.b ) r2
ON (r1.a =r2.a) ;

Output:

⦁    Same results.

⦁    Execution time: 464 secs

parrallel1

picture 2

parrallel2

picture 3

NOTE: Only a clipping of the total execution plan output is shown here for easy understanding.

Some observations:

  1. Total Map-Reduce Jobs: 5 (see picture 2 highlighted area).
  2.  Jobs are parallely Launched & Run. (see picture 3 highlighted area).
  3.  This is more clearly visible while working with large data sets.

Advantage:

Substantial decrease in query execution time (around 50% as noted in our case)

Points to Note:

⦁    We need to check HIVE for hive.exec.parallel configuration parameter to be set to value TRUE.

⦁    Further, to control how many jobs at most can be executed in parallel we can set hive.exec.parallel.thread.number configuration parameter.

Leave a Reply

Your email address will not be published. Required fields are marked *


+ 9 = thirteen

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>