Hadoop & MapReduce Interview Questions & Answers

1: What is Hadoop?

Apache Hadoop is a group of open-source software utilities that enable using a system of many computers to solve issues concerning vast quantities of data and computation. It provides a software framework for broadcasted storage and working of big data using the MapReduce programming model.

2: What is MapReduce?

MapReduce Referred to as the core of Hadoop, MapReduce is a programming framework to process massive data sets across thousands of servers. The idea of MapReduce is identical to the cluster scale-out data processing systems. The term MapReduce refers to two essential processes of the Hadoop program operates.

3: Give a simple example of the working of MapReduce.

Let’s take a simple illustration to comprehend the functioning of MapReduce. Regardless, in real-time tasks and applications, this is becoming complicated and complex as the information we deal with Hadoop and MapReduce is vast and immense.

Imagine you have four files, and each file has two key/value pairs, a city name, and its humidity recorded. Here, the name of the city is the key, and moisture is the value.

New York, 22

Houston, 15

Chicago, 30

Washington, 25

Houston, 16

New York, 28

Chicago,12

Now, we need to estimate the highest humidity for each city across these four files. As described, the MapReduce framework will split it into four map tasks, and each map task will conduct data functions on one of the four files and returns maximum humidity for each city.

(New York, 22)(Houston, 16)(Chicago, 30)( Washington, 25)

Likewise, each mapper completes it for the other three files and deliver intermediate outcomes, for example, like below.

(New York, 32)(Houston, 2)(Chicago, 8)(Washington, 27)

(New York, 29)(Houston, 19)(Chicago, 28)(Washington, 12)

(New York, 18)(Houston, 24)(Chicago, 36)(Washington, 10)

(New York, 30)(Houston, 11)(Chicago, 12)(Washington, 5)

These jobs are then given to the reduce job, where the input from all files is merged to output an individual value. The final results here would be:

(New York, 32)(Houston, 24)(Chicago, 36)(Washington, 27)

These computations are performed immediately and are incredibly efficient to compute outputs on a large dataset.

4: What are the significant elements of the MapReduce Job?

Main Driver Class provides the job configuration parameters.
Reducer Class: extends org.apache.hadoop.mapreduce.Reducer class
Mapper Class extends org.apache.hadoop.mapreduce.Mapper class and performs execution of map() method

5: What is Shuffling and Sorting in MapReduce?

Shuffling and Sorting are two significant processes working simultaneously during the working of Mapper and reducer.

The procedure of moving data from Mapper to reducer is Shuffling. It is a compulsory operation for reducers to move their jobs additional as the shuffling method serves as input for the reduce tasks.

6: What are Partitioner and its usage?

The partitioner is another crucial phase that maintains the middle map-reduce output keys using a hash function. The method of partitioning specifies in what reducer a key-value pair is sent. The number of partitions is equivalent to the total number of reduce jobs for the procedure.

7: What is Identity Mapper?

Identity Mapper is the defaulting Mapper class supplied by Hadoop. When no other Mapper class is specified, Identify will be performed. It writes the input data into output and does not function any computations and calculations on the input data.

8: What is Chain Mapper?

Chain Mapper is implementing precise Mapper class via chain processes across a set of Mapper classes, within a single map task. In this, the first mapper’s result evolves into the input for the second mapper and the second mapper’s result into the input for the third mapper, and so on until the last mapper.

9: What is NameNode in Hadoop?

NameNode in Hadoop is the node that Hadoop keeps all the file location data in HDFS. In different words, NameNode is the centerpiece of an HDFS file system. It holds the history of all the files in the file system and follows the file data across the cluster or numerous machines.

10: Explain what is a heartbeat in HDFS?

Heartbeat is a signal utilized within a data node and Name node, and between task tracker and job tracker. If the Name node or job tracker does not reply to the signal, it is believed there are some problems with the data node or task tracker.

11: Explain combiners?

Combiners are used to increase the efficiency of the MapReduce job. The data that is transferred across the reducers can be reduced with the help of combiners.

12: When a combiner is used in a MapReduce Job?

When the operation served is commutative and associative, the reducer code can be used as a combiner. The performance of the combiner is not insured in Hadoop.

13: When does a data node fail?

The data node fails when a job tracker or the namenode detects the failure. If all the tasks are rescheduled on the failed node or namenode, replicates the user data to another node.

14: What is Speculative Execution?

When a certain number of duplicate tasks are launched, that state is called speculative execution. Numerous copies of the exact map or reduce task can be performed on a different slave node using speculative execution. In easy words, if a particular drive is taking a long time to complete a task, Hadoop will create a duplicate job on another disk. A disk that finishes the task first is retained, and disks that do not finish first are killed.

15: What are the basic parameters of a Mapper?

The fundamental parameters of a Mapper are

LongWritable and Text
Text and IntWritable

16: Differentiate between Input Split and HDFS Block?

The logical division of data is known as Split. On the other hand, a physical division of information is known as HDFS Block.

17: Explain what happens in text format?

In-text input format, every line in the text file is a record. Value is the content of the line, while Key is the byte offset of the line. For instance, Key: longWritable, Value: text.

18: Explain what is WebDAV in Hadoop?

To sustain editing and updating files, WebDAV is a set of extensions to HTTP. WebDAV claims can be mounted as filesystems on every operating system, so it is feasible to access HDFS as a regular filesystem by exposing HDFS over WebDAV.

19: Explain what is Sqoop in Hadoop?

Sqoop is utilized to transfer the data between RDBMS and Hadoop HDFS. Using Sqoop, data can be moved from RDMS like MySQL or Oracle into HDFS and shipping data from the HDFS file to RDBMS.

20: Explain how JobTracker schedules a task?

The task tracker transmits heartbeat notifications to Jobtracker every few minutes to confirm that JobTracker is operational and functioning. The message also tells JobTracker about the number of accessible slots so that the JobTracker can remain updated.

21: Explain what Sequencefileinputformat is?

Sequencefileinputformat is utilized for reading files in series. It is a distinct compressed binary file format optimized for giving data between the outcome of one MapReduce job to the intake of the next MapReduce job.

22: What does the conf.setMapper Class do?

Conf.setMapperclass forms the mapper class and all the things associated with the map job, such as getting data and developing a key-value pair out of the mapper.

23: Explain what Hadoop is?

It is an open-source software framework for keeping data and running applications on clusters of commodity hardware. It provides massive processing capability and enormous storage for any data.

24: What are the data elements used by Hadoop?

Data elements used by Hadoop are

Hive
Pig

25: What the data storage component used by Hadoop is?

HBase is the data storage component used in Hadoop.

26: In Hadoop, what is InputSplit?

It breaks input files into pieces and allocates each split to a mapper for working.

27: Mention what rack awareness is?

The placing of blocks on the rack definition determined by NameNode is called rack awareness.

28: Explain what a Task Tracker is in Hadoop?

The slave node daemon in the cluster that accepts tasks from a job tracker is Hadoop’s task tracker. It also transmits out the heartbeat notifications to the JobTracker, every few minutes, to ensure that the JobTracker is still active.