New SQL on Hadoop implementations are available for big data processing. While working on a large data problems, we decided to pick up Impala and Presto to see what do they offer.
While both the implementations bypass map-reduce (to prevent a lot of associated I/O) and directly perform distributed processing on underlying storage, there are some differences due to some inherent design differences among the systems, which we would explore in this article.
Impala – As per Cloudera “Impala is a fully integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop – combining the familiar SQL support and multi-user performance of a traditional analytic database with the rock-solid foundation of open source Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise”.
We used Impala on Amazon EMR for research.
Presto – Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
We used Qubole’s cloud based Presto as a Service for Performance evaluation and also built our own Presto cluster to understand it’s various components and their respective Installation and configuration.
Impala – It is very easy to setup and configure impala cluster. There are web interfaces available to view cluster, individual nodes resource usage, configuration, query profiles, etc.
Issue – we observed that sometime cluster gets corrupted after restarting any node.
Presto – It is very easy to setup and configure Presto cluster. It also provides web interfaces to view cluster information.
Impala – It uses HDFS caching which is available in latest version of CHD but it did not improve performance. We did some research and also checked on forums and found that it’s a recently added feature and there is implementation issue (the way it interacts with HDFS cache), expectedly it will be fixed in the next version.
Presto – By default does not have caching but Qubole distribution provides on-disk file caching.
Impala – It uses HDFS for storage and stores data in Text and Parquet storage format. Parquet format is columnar format and stores data as compressed. Parquet gives better results in terms of performance (timing) comparatively because of compression and columnar support.
Presto – Presto can work upon various data file formats like: ORC, RCFile etc. that are typically stored on HDFS or in Amazon S3 using the HIVE metastore service.
It can also connect to other data sources like MySQL, Cassandra etc.
In case of Presto, we evaluated the use of various file formats stored on S3.
Impala – Impala agent runs on each data node and execute query locally (on local data) and returns result back to main node where user executed query. So when we add more nodes performance increases as execution distributes.
Since it can load table data in memory it is advised to have high memory configuration data nodes for better performance.
Presto – In Query execution the main components involved are: Co-ordinator and workers.
- The co-ordinators main responsibilities include SQL parsing, Query planning and execution planning. Presto prepares the best fit Distributed query plan based on DAG.
- Workers are the components that in reality execute the tasks assigned.
Overall Presto converts the SQL Query into group of stages, tasks and drivers wherein the workers execute the tasks in parallel. and the individual tasks operate only on small part of the data viz. data chunks. As we increase the number of worker nodes the performance increase is expected subject to the problem in question.