INTRODUCTION

Hive metastore and Impala metastore are the tools that allow performing several SQL queries on data residing in the HBase/HDFS. These are easy to learn for anyone who has experience in developing SQL. Hive makes use of HiveQL and this converts the data into the Spark Jobs or MapReduce which runs on a Hadoop cluster. Impala makes use of a specialized and fast SQL engine.

Let us now understand what the Hive Metastore is.

  1. What is Hive Metastore?
  2. Why use Hive Metastore and Impala Metastore?
  3. Defining Databases and Tables
  4. Data Types in Hive
  5. HCatalog and Its Uses
  6. Impala on Cluster
  7. Summary

1.What is Hive Metastore?     

Hive Metastore is a component in Hive that stores the catalog of the system that contains the metadata about Hive create columns, Hive table creation, and partitions. Metadata is mostly stored in the traditional form of RDBMS.

The Apache Hives make use of the Derby databases to store the metadata. Any of the JDBC compliant or Java database connectivity like MySQL can be used to create a Hive Metastore.

Many primary attributes should be configured for the Hive Metastore. Some of them are:

· URL Connection

· Driver Connection

· User ID Connection

· Password Connection

2.Why use Hive Metastore and Impala Metastore?

Hive and the Impala server make use of Metastore to get the data location and the table structure. The query gets sent to the Impala or the Hive server and then it catches the Hive Metastore to get the data location and the table structure. The query when gets received, the server then queries the actual data on the Hive table that is present in HDFS.

All the data by default gets stored in the /user/hive/warehouse. Every table is a directory in the default location and contains one or more than one file.

3.Defining Databases and Tables

Here we will understand how to define, create and then delete the database. When you create a database in Hive it is managed using the DDL or the data definition language of the HiveQL or impala SQL.

  • To create a database you need to write CREATE DATABASE databasename
  • To avoid any error in case if the database already exists you write CREATE DATABASE IF NOT EXISTS databasename
  • To remove the database is similar to how you create a database. Here you will have to replace Create with a DROP: DROP DATABASE databasename;
  • If the database exists already then you type DROP DATABASE IF EXISTS databasename;

4.Data Types in Hive

Here you learn about the various data types in Hive. The data types are of three types. These are primitive, complex, and user-defined.

  • Primitive data types include the integers likeBIGINT, TINYINT, INT, and SMALLINT
  • Complex data types include Maps, Structs, and arrays
  • User-defined data types include structures with any type of attribute

The table data is stored in the default warehouse location by default. The default tables are managed or internal where the data gets deleted when the table gets removed from any internal location.

The data either in Hive or Impala are applied to a schema or a plan and it gets pulled from a stored location rather than the data being pulled out directly. This is why both Hive and Impala are schema on.

5.HCatalog and Its Uses

In the following topic, you will learn about the HCatalog and its uses. HCatalog is a Hive sub-project that offers access to the Hive Metastore.

  • It lets the users define the tables making use of the DDL syntax and the HiveQL
  • It is accessible with the command line and the REST API
  • It lets the users access the tables that are created from the HCatalog from the Impala, Hive, and MapReduce.

6.Impala on Cluster

Along with the Namenode in Impala, there are two Impala daemons in the master node. These are the catalog daemon that will relay the metadata change to every Impala daemon in the cluster. The State Store daemon offers a lookup service for the Impala daemons and checks their status periodically.

7.Summary

In the Hive tutorial, we have learned that every table will map to a directory in the HDFS. The Hive Metastore will store the data about the data in the RDBMS. The tables get created and managed using the HiveQL DDL or the Impala SQL. SQOOP offers the support to import the data into the Hive and Impala from the RDBMS. HCatalog offers access to the Hive Metastore from the tools outside impala or hive.

CONCLUSION

Hadoop has grown and developed since it was introduced. Facebook introduces the Apache Hive to manage and process the large sets of data in a distributed storage in Hadoop. 

Cloudera Impala was developed to solve the limitations that are posed by the low Hadoop SQL interactions.

Cloudera Impala and Apache Hive are competitors that are vying to get accepted in the space of database query and the debate of which among the two is the best continues.

If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional. 

ALSO READ

SHARE