Apache pig

Apache pig

Table of Contents

Apache pig is an abstraction over MapReduce. It is a tool or platform which is used to analyse larger sets of data flows. Pig is generally used with Hadoop we can perform all the data manipulation operations in Hadoop using apache pig. a high level language known as pig latin is used to write the programs. This language provides various operators using which programmers have own functions for reading, writing execution data. By using Apache pig, programmers want to write scripts for pig latin language. These scripts are transformed to Map or reduce tasks. Apache Pig has  pig engine and pig latin scripts as mode of input and these scripts are converted into Mapreduce jobs.

Why we need the Apache pig?

The programmers not very well at java struggle working with Hadoop while performing any Map reduce tasks. Apache pig is boon for programmers as, using Pig Latin programmers will perform Map reduce tasks easily without having to type complex code  in java.

Apache pig uses multi-query approach by reducing the length of the codes. An operation that will require us to type 200 lines of code in java can be easily by typing as less as just 10 lines of code in Apache pig. Apache pig reduces the development time almost 16 times.

Pig Latin is SQL like language and it is easy to learn Apache pig when familiar with SQL.

Apache pig having built-in operators like joins, ordering. This also gives nested data types like tuples, bags  missing from Mapreduce.

Features of Apache pig:

There are many set of features of Apache pig

  • Rich set of operators- This will provide many operators to perform operations like join, sort etc.
  • Ease of programming- Pig latin is similar to sql and easier for writing a pig script suppose if we are good at SQL
  • Optimisation opportunities- The work in apache pig will optimise their execution as the programmers has focus on semantics of the language.
  • Extensibility- There are operators available where users develop their functions to read, process and write data also.
  • UDF’s- Pig provides the facility to create user-defined functions in the programming language like java and invoke them embed them in pig scripts.
  • It handles all kinds of data- Apache pig will analyse all kinds of data both structured and Un-structured. This stores the results in HDFS.

Applications of Apache pig:

Apache pig used by data scientists for performing tasks involving ad-hoc processing and also quick prototyping. Apache pig used

  •  It is used to process the huge data sources like web logs.
  •  It does data processing for search platforms.
  •  To process time sensitive data loads.

Apache architecture:

The language used to analyze data in Hadoop using pig that is known as pig Latin. It is high-level data processing language that provides a rich set data types and operators does tasks on the data. For specific task programmers who use Apache pig latin language and run them the execution methods. After the executions these scripts have transformations has pig framework for desired output.

The components of Apache pig are:

Parser

The pig script are handled by the parser.It checks the syntax of the script does type checking and other miscellaneous checks. The output will represent the pig Latin statements and logical operators.

Optimizer

There are logical plan has logical optimizer which comes out to the logical optimizations which carries out the  logical optimizations like projection and pushdown.

Compiler

The compiler runs the optimized logical plan into  a series of MapReduce jobs.

Execution engine

The MapReduce jobs that are submitted to Hadoop in a sorted order to produce the desired results.

Questions

1. What is Apache pig?

2. Explain the architecture of Apache pig?

Share this article