Introduction 

If you are not from the IT world, you probably have no idea what the hive is. Over the last 10 years “Hive” has acted as a digital stockroom for large data files. As simple and appealing the name is, hive too is quite simple software.  Read on further to understand what hive is.

  1. Definition of Hive
  2. History of Hive
  3. What is Hive Hadoop?

1) Definition of Hive

Hive is a statistical storehouse software which is used to scrutinize and process structured data which piles on the top of Hadoop. It is an open house data technology which is used to analyse large data files.

2) History of Hive

The hive was originally developed and used by Facebook in the year 2010. They used to analyse data through Oracle database. But with over 900 million users, it became unmanageable and hence they came up with Hive. Later on, the Apache Software Foundation developed it further as an open-source. As time passed, companies like Netflix, Amazon, Citi, Pinterest and many other humongous companies and corporate executives started using Hive. 

What is Hive? Hive is an ETL ( extract,  transform, load ) and Data warehouse tool which is curated to process structured data.  It provides languages like SQL for querying called HiveQL or HQL. The SQL server is a relational database management system which is developing, operated and marketed by Microsoft. SQL is available in various languages such as English, French, Chinese, German, Italian, Japanese, Korean, etc.

It is designed for OLAP and is fast, scalable and extensible.  HiveQl or Hive query language is a query language to operate structured data in a Metastore. Apache Hive helps in the provision of a simplified query model with lesser coding than MapReduce.  The response time is way faster than any type of queries on the same scale of huge data sets. Apache hive supports executing on different computing skeleton. It provides support to ad hoc querying data on HDFS.

A party hive backs user-defined activities, functions, scripts and customised I/O format to expand its services. Mature JDBC an ODBC drivers grant many applications to pull hive data for effortless and smooth reporting. Hive lets the user read data an arbitrary formats using service and input slash output formats. Hive has a clear cut architecture format metadata management, verification as well as query optimizations. There is an extremely big community of developers and techies working on users of the hive.

Uses of Hive: Hive is used as an ETL tool to extract, transform and load with the help of data definition language and data manipulation language. As compared to MapReduce, Hive is easier to use and two tables can be joined with minimal effort. With hive, various other functions can be performed including partitioning, bucketing, UDF, etc. It allows you to communicate with third-party applications using thrift servers. Hive provides facilities to connect to NoSQL database like Hbase a MongoDB using storage handlers.

  • The Architecture of Hive

Hive consists mainly of three core parts .i.e. 1) Hive Clients, 2) Hive services and 3) Hive storage and computing.

Hive Clients- Hive provides various drivers for communication with a variety of applications.

 Hive Services –  Any type of client interaction takes place through Hive services. The driver which is present in the Hive services embodies the main driver and it communicates with all types of JDBC, ODBC and other client-specific applications. The driver then processes these requests from different applications to Metastore and field systems to facilitate further processing.

Hive Storage And Computing- Metadata material of tables created in Hive are stored in “Meta storage database “. Query results and data loaded in the tables are then stored in Hadoop collection or cluster on HDFS.

User Interfaces – The user interfaces that hive supports are Hive Web UI, Hive command line and Hive HD Insight. 

Meta store – Hive selects respective database servers to save the schema or Metadata of tables, databases, columns in a table, their data types and HDFS mapping. 

HiveQI Process Engine – Rather than writing a MapReduce program in Java, we can write a query for Mapreduce job and process it.

Execution Engine – It processes the query and generates results same as MapReduce. 

HDFS- Hadoop distributed file system is a technique used to store data in into file systems.  

3) What is Hive Hadoop?

Hadoop is a free, Java assisted programming structure that provides the processing of large data sets in a distributed computing environment. It contains two modules i.e. MapReduce and Hadoop Distributed File System. 

It enables us to run applications with thousands of nodes involving thousands of terabytes.  It reduces the risk of failure and enables a computing solution that is scalable, flexible, cost-effective and fault-tolerant. 

Hive Table- There are two types of Hive tables which are as follows 1) Managed Table and 2) External Table.

Managed Table- Here, Hive assumes that it is in possession of the data. The metadata is deleted if a managed table is dropped.

For managed tables, the data is stored in by Hive in its directory.

The managed table gives transitional action support or ACID. Statements such as Archive, Unarchive, Merge, Concatenate, Truncate are supported. 

Query result caching is supported in a managed table.

External Table- Here, Hive assumes that it does not possess the data.

Dropping the table does not delete the data, but the metadata for that particular table is deleted. 

Here, the data is stored in the location specified during the creation of the table and not the directory. 

No transactional action supported or ACID is provided. 

Statements such as Archive, Unarchive, Concatenate and Merge are not supported. 

Query resulting caching is also not supported. 

Limitations of Hive – Hive is not suitable for online transaction processing i.e.  (OLTP), it can only be used for online Analytical Processing. Hive lets you overwrite or apprehend data, but does not let the update or delete it. Hive does not support subqueries as well.

Conclusion

Although it was only founded a few years ago, Hive has already captivated the eyes of hundreds of users including Starbucks, Netflix, Amazon, IBM, Uber, etc. Its remarkable features and cost-effective prices make it a great choice for many project management teams across the globe. 

With the above article, we hope that we were able to give you a brief idea about what Hive is, what are its functions and how it works.

If you are interested in making it big in the world of data and evolve as a Future Leader, you may consider our Integrated Program in Business Analytics, a 10-month online program, in collaboration with IIM Indore!

ALSO READ

SHARE
share

Are you ready to build your own career?