Microsoft’s Big Data Initiative with Apache Hadoop, HIVE and PIG

Big data:                                                              

Nowadays the hottest topic of discussion in the technology world is Big Data. Before analyzing the Opportunities and Microsoft’s offerings in big data the most important thing is to understand what and where  is big data.

Big data is data that is hard to manage with conventional resources i.e. single machine power, relational data base solutions, and it is unstructured, ever growing and sometimes at very high rate.

The opportunities in big data can be well understood from the following statement:

“By 2015, organization integrating high-value, diverse, new information types and sources into a coherent information management infrastructure will outperform their industry peers financially by more than 20%.”

Information mgmt. in the 21st century

Gartner, September 2011

Opportunities and Microsoft’s position

As seen in the last few years companies like Google and Amazon have lead the web with their abilities to analyze data. Google’s Ad plus and Amazon’s online shopping site are good examples of it. Companies like Apache, LexisNexis Risk Solutions have offered big data solutions like Hadoop and HPCC for large data analysis or processing. But all of those solutions were Linux compatible. Hence big data analysis was inaccessible for companies running Windows and using Microsoft tools for development.

Microsoft Hadoop

Microsoft broke the ice last year with its very first preview of Apache Hadoop on Azure, offering a cluster of 6 machines for Hadoop which was later cut down to 2. But everyone was waiting for the next move from Microsoft and then came Hadoop on Windows server named as HDInsight. Microsoft introduced their windows installer for Hadoop on windows running on Apache version 1.1. Microsoft’s Hadoop platform is built on Hortonworks Data platform. Hence the launch of Hadoop on Windows enables users with the power of Hadoop’s to store and analyze data with simplicity of windows.

Microsoft is also providing HDInsight Service on Azure with advantages of on demand scalability and high availability of cluster. Another advantages is that of high speed inter node connectivity that enables high performance Map-Reduce jobs to run over HDFS data. Major advantage of using Hadoop on Azure is  that it provides a simple web interface for node management and cluster allocation of on demand nodes.

Microsoft’s offering with their Hadoop version for Microsoft Windows and Azure

Microsoft’s Hadoop is 100% Apache Hadoop compatible so it is also having same basic components:

  • HDFS: Apache Hadoop distributed storage. File system that works on cluster analogous to GFS.
  • Apache Map-Reduce: For distributed data processing.

It also supports sister projects for data analysis in Hadoop:

  • Apache HIVE: Data warehouse for querying and analysis.
  • PIG: High-level infrastructure for creating Map-reduce jobs.

Other major features offered:

  • Interactive console: Java script enabled Hadoop command front end. Web enabled support from running Map-Reduce jobs.
  • Hive ODBC driver and Hive Add-in: Connectivity to Microsoft Office Excel 2013 and Business Intelligence tools.
  • .NET SDK for hadoop.
  • Web interface for progress management and history management.

What is Microsoft betting on when the same service is offered by Apache under Open Source License?

Microsoft differentiates its platform from other open source solutions available in market (Apache Hadoop Linux based, HBase, Hypertable etc) in the following ways:

  • Support for windows platform, providing an opportunity in Big Data  to companies that use Microsoft Platforms 
  • Simple management and simple environment
  • Easy connection with various data sources. E.g. census data from government, social media, etc.
  • Connectivity drivers for various Microsoft tools like Microsoft Excel etc.
  • Hadoop .NET SDK, enabling users to develop Map-Reduce jobs in .NET friendly programming languages.
  • HDInsight Services over Azure that ensure high availability of Hadoop Cluster making it a perfect platform for enterprise application development
  • HDInsight Services over Azure provide high speed inter data node connectivity that enables high performance Map-Reduce jobs

This is a summary of all that Microsoft is offering with Hadoop in Big Data. In another blog I will bring technical details of all offerings, my analysis of this move and what all we can expect in terms of business from the Microsoft-Hortonworks duo.