Do you have a Hadoop interview scheduled soon? Most big data interviews expect you to have a grasp over popular frameworks and tools. Some of the common topics are Hive, MapReduce or Flume. Since Hive is a fairly common framework, employers usually ask a number of questions related to it.
This article will guide you through a list of the fifteen most commonly asked questions about Hive. Practice with these and refresh your major Hive concepts. Whether you are an experienced candidate or a fresher, this guide will definitely help you!
Top 15 Questions and Answers for a Hive Interview
Read on for the top 15 questions and answers for your next Hive interview.
1. Describe Apache Hive
Ans. Apache Hive is a framework built on Hadoop for data warehousing. It facilitates queries and manages large datasets that are stored in distributed storage. Mainly used to analyze structured or semi-structured data, Hive projects are structured onto the data. Hive also helps execute queries written in HQL (Hive Query Language) which is like SQL statements. Going ahead, the Hive compiler converts the queries into map-reduce jobs. Hive supports applications written in Python, Java, C++, PHP, and Ruby.
2. What is the difference between a local and remote metastore?
Ans. The metadata information in Hive is stored using Metastore. This is achieved by using RDBMS and an open-source ORM layer.
Local metastore service is configured to run on the same JVM as the Hive service. It connects to a different database running on another JVM that is either on the same machine or on a remote machine.
Remote metastore is configured to run on its own JVM that is separate from the Hive service. Multiple metastore serves can ensure maximum availability.
3. Describe the components of a Hive Architecture.
Ans. There are five major components that make up a unit of Hive architecture. These are:
- User Interface — The user can submit queries and other operations to the Hive system. Hive web UI, Hive command line, and Hive HD Insight are all supported by the user interface.
- Driver — It creates a session handle and sends queries to the compiler. Then, an execution plan is created for them.
- Metastore — The metastore holds the data in a structured manner, along with the specified tabular information and warehouse partitions or attributes. Once the metadata request is received, it is sent to the compiler.
- Compiler — It creates an execution plan. This allows it to parse the queries, perform semantic analysis on different query blocks, and come up with query expressions.
- Execution Engine — The execution engine implements compiler’s execution plan. It also handles the dependencies of the plan at various stages.
4. What is a Hive Partition and what importance does it hold?
Ans: Hive holds similar data together in the form of tables. These tables are organized into partitions either according to columns or partitio keys. This makes a partition a sub-directory in the table. A table might have more than a single partition key for any given partition.
Partitioning lets you achieve granularity in your Hive table, which helps lower query latency. This is because only the relevant partitioned data is scanned instead of the entire dataset.
5. Describe a Hive variable
Ans. A Hive variable starts transferring values to hive queries as soon as the query starts executing. This is done using the source command. It is created in the Hive environment and is developed with Hive scripting languages.
6. What is dynamic partitioning in Hive and what are its applications?
Ans. Dynamic partitioning implies that the values for the partition column are known in real-time when the data is loaded into Hive tables. Dynamic partitioning can be used when data is being fetched from a non-partitioned table. This will improve latency and sampling. In addition, a dynamic partition also plays an important role when the value of partitions is not known. In such cases, calculation can be tiring, so it is wise to opt for a dynamic partition.
7. What is the difference between Hive and HBase?
Ans. While Hive and HBase are both based on Hadoop, they are used for different technologies. Hive is an infrastructure warehouse of data based on Hadoop and HBase is a NoSQL database. Hive Queries are executed as MapReduce jobs with high latency for huge datasets. HBase operates in real-time on HDFS with low latency.
8. Describe the default database with Hive Megastore support.
Ans. Hive Megastore supports an Embedded Derby database instance, also named as the embedded Metastore configuration.
9. Differentiate between external and managed tables in Hive
Ans. A managed table deletes both metadata and table data from the Hive warehouse directory if you leave or exit the application. The external table only deletes the metadata information associated with the table when you leave. The table data is safely retained in HDFS.
10. What is a Hive Index?
Ans. Hive index is a query optimization technique. It speeds up access to any specific column or a set of columns specified from a Hive database. This removes the need to read all the rows of the table, allowing it to easily fetch data.
11. Describe the types of JOIN in Hive
Ans. The types of JOIN are
- JOIN- Similar to the Outer Join in SQL
- LEFT OUTER JOIN- to return all rows from left table
- RIGHT OUTER JOIN- to return all rows from the right table
- FULL OUTER JOIN- combining left and right table records that fulfills the JOIN conditions
12. What is the purpose of Hcatalog?
Ans. Hcatalog allows easy sharing of data structures to external systems. In addition, it provides access to the Hive megastore, facilitating direct data transfer to the Hive data warehouse.
13. What are the main advantages of using Hive?
- Allows for data summarization, query, and analysis
- Data processing without storing in HDFS
- Support for external tables
- It Fits Hadoop’s low-level interface requirements
14. What are the limitations of the Hive?
Ans. Some limitations faced by Hive are:
- Cannot perform real-time queries
- Lack of row updation
- Not right for Online transaction processing
15. Can the same metastore be used by many users in the case of an embedded hive?
Ans. No, using metastore in sharing mode is not possible. An alternative is to use it in standalone databases like MySQL or PostgreSQL.
These are just some of the most frequently asked questions that might crop up in a Hive interview. This will definitely speed up your preparation and give you a quick refresher. Wish you the best of luck!