Right now, Apache Hadoop open-source distribution is the most popular big data solution available in the industry, which provides distributed data storage and parallel processing capabilities. It helps various enterprises to process and analyse both structured and unstructured data for extracting useful business insights. Hadoop Ecosystem consists of many components like Pig, Hive and HBase, each one with its own unique advantage related to big data processing.

Though developed and distributed under open-source license by Apache Software Foundation, Hadoop and its components are independent to each other and often requires manual installation and integration while building a cluster. One also has to deal with updating different versions in line with new developments. There is also no commercial support available, and your best bet when stuck is to do a google search, or go to technical forums like stack overflow etc. In short, getting started with Hadoop cluster setup using Apache Hadoop could be a little challenging.

Thankfully, few enterprises have started making their own distributions which are available as both open-source and commercial versions for implementing Hadoop. Each of these Hadoop distribution deployments will implement some, or all of the components under Hadoop Ecosystem depending on the project requirements. These distributions have hardened the open-source version with some proprietary add-ons and custom support in order to provide a complete Hadoop package which simplifies the overall installation process. Some of the top Hadoop distribution vendors include Cloudera, HortonWorks, MapR, IBM, Intel, Microsoft, Amazon Web Services (AWS) and Pivotal Software. These vendors of Hadoop distributions often vary in their offerings where some of them provide basic open source software with support and training services while others provide additional customizations for ease of the cluster deployment, administration and operations of Hadoop.

Latest 2014 rankings for Hadoop distribution vendors released by Forrester Research place Cloudera, HortonWorks and MapR as the top three industry choices. In terms of job posting comparisons on indeed.com for these Hadoop distributions, we can see that Cloudera leads the demand followed by HortonWorks and MapR respectively. Though at a general level, all of these vendors make use of open-source Apache Hadoop version in their distributions, various differences exist; like Cloudera and MapR solutions are targeted towards Linux OS, whereas HortonWorks solutions are applicable for Windows OS.

Picture1

Cloudera’s Hadoop Distribution, CDH4 version includes HDFS, YARN, HBase, MapReduce, Hive, Pig, Zookeeper, Oozie, Mahout, Hue, and other open source tools (including the real-time query engine – Impala). Cloudera Manager Free Edition includes all of CDH, plus a basic Manager supporting up to 50 cluster nodes.

Cloudera Enterprise combines CDH with a more sophisticated Manager supporting an unlimited number of cluster nodes, proactive monitoring, and additional data analysis tools. Hortonworks Hadoop Distribution, HDP version 2.0 includes HDFS, YARN, HBase, MapReduce, Hive, Pig, HCatalog, Zookeeper, Oozie, Mahout, Hue, Ambari, Tez, and a real-time version of Hive (Stinger) and other open source tools. It also provides high-availability support, a high-performance Hive ODBC driver, and Talend Open Studio for Big Data. MapR Hadoop Distribution, M7 version includes HDFS, HBase, MapReduce, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools. It also includes direct NFS access, snapshots, and mirroring for “high availability,” a proprietary HBase implementation that is fully compatible with Apache APIs, and a MapR management console.

Lets take a quick look at the key features of these three Hadoop Distributions:

http://core0.staticworld.net/images/article/2014/06/hadoop-distib-100341967-large.idge.gif

hadoop-distib-100341967-large.idge

Image courtesy http://blog.pluralsight.com/top-3-hadoop-distributions

Today, Hadoop is not only an integral part of the big data ecosystem but it has also given birth to a whole new set of related tools. It will be interesting to watch and see what new developments surface, which make Hadoop even easier to use in the near future. Watch this space, we will keep you updated!

Want to know how to deal with unstructured data in Hadoop? Take a look at Jigsaw Faculty Pavithrra’s article Hadoop And Unstructured Data.

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.
Jigsaw’s Data Science with SAS Course – click here.
Jigsaw’s Data Science with R Course – click here.
Jigsaw’s Big Data Course – click here.
SHARE
share

Are you ready to build your own career?