Apache Hive Tutorial: What is Hive?
Apache Hive is a data warehouse system mounted on top of Hadoop and used to examine structured semi-structured data. Hive abstracts the complexity of Hadoop MapReduce. It gives a mechanism to project structure onto the data and make queries written in HQL (Hive Query Language) related to SQL statements. Inside, these queries or HQL gets changed to map-reduce jobs on the Hive compiler. Hence, you don’t want to bother addressing complex MapReduce programs to prepare your data utilising Hadoop. It is targeted towards users that are satisfied with SQL. Apache Hive helps Data Manipulation Language (DML), Data Definition Language (DDL), and User Defined Functions (UDF).
Challenges at Facebook: Exponential Growth of Data
Before 2008, the company built all the data processing infrastructure on Facebook around a data warehouse based on commercial RDBMS. These foundations were able enough to satisfy the needs of Facebook at that time. As the data began growing very fast, it developed a considerable challenge to handle and process this vast dataset. According to a Facebook study, the data estimated from 15 TB data set in 2007 to a 2 PB data in 2009. Also, various Facebook products include analysing data like Audience Insights, Facebook Lexicon, Facebook Ads, etc. They wanted a scalable and efficient solution to cope with this dilemma and commenced using the Hadoop framework.
Democratizing Hadoop – MapReduce
But, as the data accumulated, the complexity of Map-Reduce codes grew proportionally. So, supporting somebody with a non-programming background to address MapReduce programs grew complex. Also, for doing easy analysis, one has to compose a hundred lines of MapReduce code. As, engineers and analysts, including Facebook widely used SQL, putting SQL on the top of Hadoop, seemed a rational way to make Hadoop available to users with SQL background.
Hence, the strength of SQL to answer for most of the analytic demands and the scalability of Hadoop provided birth to Apache Hive that enables executing SQL like queries on the data existing in HDFS. Later, the Hive project was open-sourced in August’ 2008 by Facebook and is easily obtainable as Apache Hive today.
Now, let us look at the characteristics or benefits of a Hive that makes it so famous.
Advantages of Hive
- Helpful for people who aren’t from a programming experience reduces the need to write complex MapReduce program.
- Extensible and scalable to cope with the increasing amount and diversity of data, without altering the system’s performance.
- It is as an effective ETL (Extract, Transform, Load) tool.
- Hive raises any client application written in Java, PHP, Python, C++ or Ruby by showing its Thrift server
- Hive’s metadata data is collected in an RDBMS, and it significantly decreases the time to make semantic checks while querying execution.
NASA Case Study
A weather model is a mathematical description of climate systems based on several factors that affect Earth’s climate. It explains the interaction of many drivers of environment like the ocean, sun, atmosphere, etc., to give penetration into the climate system’s dynamics. It is used to project weather conditions by mimicking the climate changes based on circumstances that influence climate. NASA’s Jet Propulsion Laboratory has produced the Regional Climate Model Evaluation System (RCMES) to investigate and estimate the climate output model against remote sensing data existing in various outside repositories.
The RCMES (Regional Climate Model Evaluation System) has two parts:
RCMED (Regional Climate Model Evaluation Database):
It is a scalable cloud database that stores the remote sensing data and reanalysis data associated with the environment using extractors like Apache OODT extractors, Apache Tika, etc. Ultimately, it changes the data as the data point model, which is of the form (latitude, longitude, time, value, height) and reserves it into My SQL database. The customer can recover the data present in RCMED by making Space/Time inquiries. The classification of such questions is not relevant to us now.
RCMET (Regional Climate Model Evaluation Toolkit):
It enables the user to match the reference data present in the RCMED with the weather model output data fetched from other sources to produce different reviews and evaluations. You can relate to the image given beneath to understand the structure of RCMES.
Deployment of Hive
The below image presents the deployment of the apache hive in RCMES. Following actions were practised by the NASA team while using Apache Hive:
- They established the Hive using Cloudera and Apache Hadoop, as shown in the below image.
- They practised Apache Sqoop to transfer data into the Hive from the MySQL database.