Hive is to one of the famous application of data warehouse system which can be used for structured data. It can be built on the top of Apache Hive Hadoop. It was developed on Facebook. Hive has functionality of reading and ,managing large datasets which makes home in distributed storage. It runs SQL like queries called HQL which may get internally converted to MapReduce jobs.
We have following features of Apache Hive:
- Hive is fast and scalable.
- It will offer SQL – like queries which are implicitly transformed to MapReduce or Sparkjobs.
- This is having ability to analyse large datasets stored in HDFS.
- It offers different storage types like plain text RCFile and HBase.
- It will use indexing to accelerate queries.
- It is functional of compressed data stored in the Hadoop ecosystem.
- It supports user-defined functions as the user defines its functionality.
Limitations:
- Hive will not be capable of handling real-time data.
- It will not be designed for online transaction processing.
- Hive queries will contain high latency.
Apache Hive architecture:
Apache Hive client:
Hive has applications are multilingual including Java, python and C++. It assists many different types of client such as
- Thrift server- It is a process consists particular language service provider interface requests from programming languages that assists thrift.
- JDBC Driver- It may be used to build connection between hive and java application. The JDBC Driver will be present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
- ODBC Driver- It allows the applications that assists the ODBC protocol to connect to Hive.
Hive Services:
There are many services that are provided by Hive:
- Hive CLI- The Hive CLI(command Line Interface) is nothing but considered to be shell which executes Hive queries as well as commands.
- Hive web user Interface- The Hive Web UserInterface is having different source Hive CLI. It offers a web based GraphicalUserInterface for executing Hive queries and instructions.
- Hive Metastore- It is central repository occupies structure information of various tables and partitions in the warehouse. It has metadata of column and its type information the serialisers and deserialisers that can be used to read and write data the corresponding HDFS files where the data is stored.
- Hive server- We can consider this as Apache thrift puts request from different clients and offers it to the Hive driver.
- Hive Driver- We have to run the queries from many platforms like web UI, CLI and JDBC driver. It moves queries to the compiler.
- Hive compiler- It is always having an goal to parse the query and also will do semantic analysis on the different query blocks and expressions. It may convert HiveQL statements into MapReduce jobs.
- HiveExecution engine- Optimiser will built he logical plan in form of DAG of map-decrease functions and HDFS tasks.
Hive data types
There are many data types that are categorised in many types, miscellaneous types and complex types
There are integer types:
- TINYINT-This is an single byte signed integer ranges from -128 to 127.
- SMALLINT-This 2 byte signed integer 32768 to 32 767
- INT-4 byte signed integer which ranges from 2147483648 to 2147483647
- Decimal data types
- Float-This is capacity of storing 4 byte single precision floating point number.
- DOUBLE-This is datatype having capacity of 8 byte double precision floating point number.
- Date/Time types
- Timestamp
This will support traditional UNIX timestamp with optional nanosecond precision
This is a data integer type used as UNIX timestamp in seconds. As the floating point is interpreted as UNIX timestamp in seconds with the decimal precision. There are many datatypes like column types, Null types etc.
Questions
1. What is Apache Hive?
2. Explain the architecture of Apache Hive?