MapReduce is a way of programming, and we can write a MapReduce program in any language we want. MapReduce is a programming paradigm that allows extensive scalability over thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the center of Apache Hadoop. The term “MapReduce” points to two separate and different tasks that Hadoop programs operate. The first is the map work, which uses a set of data and turns it into another set of data, where particular components are divided into tuples (key-value pairs).
The reduce task takes the map’s output as input and joins those data tuples into a smaller set of tuples.
Benifits of MapReduce
- Scalability. Enterprises can work and analyze petabytes of data stored in the Hadoop Distributed File System (HDFS).
- Flexibility. Hadoop allows for more convenient access to increased sources of data and different types of data.
- Speed. Hadoop can process the data faster using parallel processing and minimal data movement.
- Simple. Mapreduce program can be composed in several languages such as Java, C++, and Python.
How MapReduce Works?
To understand the MapReduce working let’s take a simple example of a word counter.
Suppose we have the following words as input.
Input Splits:
Input split is dividing the input data into fixed-size pieces say 16 kb or any number set by the administrator. This data is given to the map. In our example, we divided the data into two words.
Mapping
The first thing in the processing of data in the MapReduce program is Mapping. Divided data is used by mapping function to create an output. In our example, we are trying to count the number of occurrences of words. This mapping will produce a list of (word, freq) as shown in the diagram below.
Shuffling
Shuffling the data from the mapping phase is used to reorder the same words together. Take a look at the example below.
Reducing
In reducer, the output after shuffling is aggregated and a single frequency of every word is returned. Actually, this process summarizes/shortens the complete dataset.
The final output of the program is
Hello | 3 |
to | 1 |
world | 2 |
Hadoop | 1 |
Maps task is to Splitting and Mapping and the Reduce task is to Shuffle and reduce.