Introduction

Big Data interviews can take place in general lines or concentrate on a specific system or method. This article will focus on the Big Data tool- Apache Hive, which is frequently used. You get a detailed understanding of questions asked in Big Data interviews by employers connected with Apache Hive after going through this Apache Hive interview questions article.

Hadoop is an open-source framework designed to facilitate the storing and processing of large volumes of data. Hive is a data warehouse tool that works in the Hadoop ecosystem to process and summarize the data, making it easier to use. Now that you know what Hive is in the Hadoop ecosystem, read on to find out the most common Hive interview questions.

Hive Interview Questions

  1. What applications are supported by Hive?
  2. What are the different tables available in Hive?
  3. What is the difference between external and managed tables?
  4. Where does the data of a Hive table get stored?
  5. Can Hive be used in OLTP systems?
  6. Can a table name be changed in Hive?
  7. Where is Hive table data stored?
  8. Can the default location of a managed table be changed in Hive?
  9. What is a Hive Metastore?
  10. What are the types of meta stores?
  11. What is the difference between Local and Remote meta stores?
  12. What is the default Apache Hive metastore database?
  13. Can multiple users use one metastore?
  14. What are the three different modes in which Hive can be operated?
  15. Is there a data type in Hive to store date information?
  16. Why is partitioning used in Hive?
  17. What are the Hive collection data types?
  18. Is it possible to run UNIX shell commands in Hive?
  19. Is it possible to execute Hive queries from a script file?
  20. What is a .hiverc file?
  21. How can you check if a specific partition exists?
  22.  If you had to list all databases that began with the letter ‘c’, how would you do it?
  23. Is it possible to delete DBPROPERTY in Hive?
  24. Which Java class handles the input record encoding into files that store Hive tables?
  25. Which Java class handles output record encoding into Hive query files?
  26. When a Hive table partition is pointed to a new directory, what happens to the data?
  27. Do you save space in the HDFS by archiving Hive tables?
  28. How can you stop a partition from being accessed in a query?
  29. What is a table generating function on Hive?
  30. Can you avoid MapReduce on Hive?
  31. Can a Cartesian join be created between two Hive tables?
  32. What is a view in Hive?
  33. Can the name of a view be the same as a Hive table name?
  34. Can we use the LOAD or INSERT command to view?
  35. What is indexing in Hive?
  36. Are multi-line comments supported by Hive?
  37. How can you view the indexes of a Hive table?
  38. What is the Hive ObjectInspector function?
  39. What is bucketing?
  40. How is bucketing helpful?
  41. Can you specify the name of the table creator in Hive?
  42. What is Hcatalog?
  43. What is UDF in Hive?
  44. What does /*streamtable(table_name)*/ do? 
  45. What are the limitations of Hive?
  46. Why do you need a Hcatolog?
  47. Name the components of a Hive query processor?
  48. How Hive distributes the rows into buckets?
  49. How do ORC format tables help Hive to enhance the performance?
  50. What are the different components of a Hive architecture?

Hive Interview Questions for 2021

Here is the comprehensive list of the most commonly asked Hive interview questions. Interview questions on Hive may be direct or application-based.

1. What applications are supported by Hive?

Hive supports client applications based on Java, PHP, Python, C, and Ruby coding languages.

2. What are the different tables available in Hive?

There are two types of tables available in Hive – managed and external.

3. What is the difference between external and managed tables?

While external tables give data control to Hive but not control of a schema, managed tables give both schema and data control.

4. Where does the data of a Hive table get stored?

The Hive table gets stored in an HDFS directory – /user/hive/warehouse, by default. You can adjust it by setting the desired directory in the configuration parameter hive.metastore.warehouse.dir in hive-site.xml.

5. Can Hive be used in OLTP systems?

Since Hive does not support row-level data insertion, it is not suitable for use in OLTP systems.

6. Can a table name be changed in Hive?

Yes, you can change a table name in Hive. You can rename a table name by using: Alter Table table_name RENAME TO new_name.

7. Where is Hive table data stored?

Hive table data is stored in an HDFS directory by default – user/hive/warehouse. This can be altered.

8. Can the default location of a managed table be changed in Hive?

Yes, the default managed table location can be changed in Hive by using the LOCATION ‘<hdfs_path>’ clause.

9. What is a Hive Metastore?

A Metastore is a relational database that stores the metadata of Hive partitions, tables, databases, and so on.

10. What are the types of meta stores?

Local and Remote meta stores are the two types of Hive meta stores.

11. What is the difference between Local and Remote meta stores?

Local meta stores run on the same Java Virtual Machine (JVM) as the Hive service whereas remote meta stores run on a separate, distinct JVM.

12. What is the default Apache Hive metastore database?

The default database for metastore is the embedded Derby database provided by Hive which is backed by the local disk.

13. Can multiple users use one metastore?

No, metastore sharing is not supported by Hive.

14. What are the three different modes in which Hive can be operated?

The three modes in which Hive can be operated are Local mode, distributed mode, and pseudo-distributed mode.

15. Is there a data type in Hive to store date information?

The TIMESTAMP data type in Hive stores all data information in the java.sql.timestamp format.

16. Why is partitioning used in Hive?

Partitioning is used in Hive as it allows for the reduction of query latency. Instead of scanning entire tables, only relevant partitions and corresponding datasets are scanned.

17. What are the Hive collection data types?

ARRAY, MAP, AND STRUCT are the three Hive collection data types.

18. Is it possible to run UNIX shell commands in Hive?

Yes, one can run shell commands in Hive by adding a ‘!’ before the command.

19. Is it possible to execute Hive queries from a script file?

Yes, one can do so with the help of a source command. For example – Hive> source /path/queryfile.hql

20. What is a .hiverc file?

It is a file that consists of a list of commands that need to be run when the Command Line Input is initiated.

21. How can you check if a specific partition exists?

Use the following command: SHOW PARTITIONS table_name PARTITION (partitioned_column=’partition_value’)

22. If you had to list all databases that began with the letter ‘c’, how would you do it?

By using the following command: SHOW DATABASES LIKE ‘c.*’

23. Is it possible to delete DBPROPERTY in Hive?

No, there is no way to delete the DBPROPERTY.

24. Which Java class handles the input record encoding into files that store Hive tables?

The ‘org.apache.hadoop.mapred.TextInputFormat’ class.

25. Which Java class handles output record encoding into Hive query files?

The ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’ class.

26. When a Hive table partition is pointed to a new directory, what happens to the data?

The data remains in the old directory and needs to be transferred manually. 

27. Do you save space in the HDFS by archiving Hive tables?

No, archiving Hive tables only helps reduce the number of files that make for easier management of data.

28. How can you stop a partition from being accessed in a query?

Use the ENABLE OFFLINE clause along with the ALTER TABLE command.

29. What is a table generating function on Hive?

MapReduce is a programming framework that allows Hive to divide large datasets into smaller units and process them parallelly. 

30. Can you avoid MapReduce on Hive?

You can make Hive avoid MapReduce to return query results by setting the hive.exec.mode.local.auto property to ‘true’.

31. Can a Cartesian join be created between two Hive tables?

This is not possible as it cannot be implemented in MapReduce programming.

32. What is a view in Hive?

A view is a logical construct that allows search queries to be treated as tables.

33. Can the name of a view be the same as a Hive table name?

No, the name of the view must always be unique in the database.

34. Can we use the LOAD or INSERT command to view?

No, these commands cannot be used with respect to a view in Hive.

35. What is indexing in Hive?

Hive indexing is a query optimization technique to reduce the time needed to access a column or a set of columns within a Hive database.

36. Are multi-line comments supported by Hive?

No, multi-line comments are supported by Hive.

37. How can you view the indexes of a Hive table?

By using the following command: SHOW INDEX ON table_name

38. What is the Hive ObjectInspector function?

It helps to analyze the structure of individual columns and rows and provides access to the complex objects that are stored within the database.

39. What is bucketing?

Bucketing is the process of hashing the values in a column into several user-defined buckets which helps avoid over-partitioning.

40. How is bucketing helpful?

Bucketing helps optimize the sampling process and shortens the query response time.

41. Can you specify the name of the table creator in Hive?

Yes, by using the TBLPROPERTIES clause. For example – TBLPROPERTIES (‘creator’= ‘john’)

42. What is Hcatalog?

Hcatalog is a tool that helps to share data structures with other external systems in the Hadoop ecosystem.

43. What is UDF in Hive?

UDF is a user-designed function created with a Java program to address a specific function that is not part of the existing Hive functions.

44. What does /*streamtable(table_name)*/ do? 

It is a query hint that allows for a table to be streamed into memory before a query is executed.

45. What are the limitations of Hive?

Hive has the following limitations: 

  • Real-time queries cannot be executed and it has no row-level support.
  • Hive cannot be used for online transaction processing.

46. Why do you need a Hcatolog?

For sharing Data structures with external systems, Hcatalog is a necessary tool. It offers access to the Hive metastore for reading and writing data in a Hive data warehouse.

47. Name the components of a Hive query processor?

Following are the components of a Hive query processor:

  1. Logical Plan of Generation.
  2. Physical Plan of Generation.
  3. Execution Engine.
  4. UDF’s and UDAF.
  5. Operators.
  6. Optimizer.
  7. Parser.
  8. Semantic Analyzer.
  9. Type Checking.

48. How Hive distributes the rows into buckets?

Hive uses the formula: hash_function (bucketing_column) modulo (num_of_buckets) to calculate the row’s bucket number. Here, hash_function is based on the Data type of the column. The hash_function is for integer data type:

hash_function (int_type_column)= value of int_type_column

49. How do ORC format tables help Hive to enhance the performance?

You can easily store the Hive Data with the ORC (Optimized Row Column) format as it helps to streamline several limitations.

50. What are the different components of a Hive architecture?

Following are the five components of a Hive Architecture:

  1. User Interface: It helps the user to send queries to the Hive system and other operations. The user interface provides hive Web UI, Hive Command-Line and Hive HDInsight.
  1. Driver: It designs a session handle for the query, and then the queries are sent to the compiler for the execution plan.
  1. Metastore: It contains the organized data and information on various warehouse tables and partitions.
  1. Compiler: It creates the execution plan for the queries, performs semantic analysis on different query blocks, and generates query expression.
  1. Execution Engine: It implements the execution plans created by the compiler. 

Conclusion

The above-listed Hive interview questions and answers cover most of the important topics under Hive but this is in no way an exhaustive list. If you are interested in making it big in the world of data and evolve as a Future Leader, you may consider our Integrated Program In Business Analytics, a 10-month online program in collaboration with IIM Indore!

Also Read

SHARE