HDFS Tutorial: Architecture, Read & Write Operation using Java API
Hadoop Architecture
Hadoop obeys Master-Slave Architecture for distributed data processing and data storage. A Hadoop cluster is made up of an individual master and various slave nodes. A NameNode is master and DataNodes are slave nodes.
NameNode:
An HDFS contains only one NameNode, a master server that maintains the file system namespace and controls entrance to files by clients. Besides, there are several DataNodes, typically, one per node in the cluster, that control storage connected to the nodes that they run on. HDFS presents a file system namespace and enables user data to be saved in files. Inside, a file is split into more than one blocks, and these blocks are kept in a set of DataNodes. The NameNode performs file system namespace services like renaming, closing, opening, and files and folders. It also manages the mapping of blocks to DataNodes. The DataNodes are accountable for assisting read and write calls from the file system’s clients. The DataNodes too do block replication, deletion, and creation upon guidance from the NameNode.
The NameNode, as well as DataNode, are portions of software intended to work on commodity machines.
DataNode:
The commodity machines usually work on a GNU/Linux OS. Devices that support Java can run NameNode and as well as DataNode. Due to JAVA, HDFS can be deployed on any machine that supports JAVA. The design does not hinder running many DataNodes on the same computer but in a practical deployment that is rarely the case.
The presence of an individual NameNode in a cluster hugely clarifies the architecture of the system. The NameNode is the judge and folder for all HDFS metadata. The system is created in such a way that clients’ data never moves through the NameNode.
Read operation in HDFS
While reading data from HDFS, we need to get the information about the location where our respective data is stored from NameNode. The NameNode provides a handle to read data from data nodes. NameNode also provides an authentication token to the client so the only client with an authentication token can read data from the data node.
Step1:
Client Node needs to interacts with the NameNode to get the information of the data nodes. For that client will send an open request to the Distributed file system.
Step 2:
The distributed file system will talk with the NameNode and get’s the authorization token and the location of the nodes.
Step 3:
The client will send a read request to the common JAVA input stream to read data from the data nodes.
Step 4:
The input stream will read the data from the data nodes.
Step 5:
The client after getting data will send the Close command to the input stream.
Write operation in HDFS
Writing data is a little bit complex than reading data.
Step1:
Client Node needs to interacts with the NameNode to get the information of the data nodes where data need to be stored. For that client will send a create request to the Distributed file system.
Step 2:
The distributed file system will talk with the NameNode using a remote procedure call and get’s the authorization token and asks to create a new file. NameNode checks whether the file is requested already exists or not. Once everything is fine the NameNode will send back the information of the location.
Step 3:
The client will send a read request to the common JAVA output stream to write data in the data nodes.
Step 4:
The input stream will write the data in the data nodes. A pipeline is created for the process of replication. We are creating a replication of 3 levels.
Step 5:
The client after writing the data will send the Close command to the output stream.
To communicate with Hadoop’s filesystem programmatically, Hadoop gives many JAVA classes. Package named org.apache.hadoop.fs includes classes helpful in the administration of a file in Hadoop’s filesystem. These services include open, read, write, and close.
The code given below can be used to read data from HDFS.
An example code is-
FileSystem fileSystem = FileSystem.get(confguration); Path path_of_file = new Path("/path/to/file.ext"); if (!fileSystem.exists(path_of_file)) { System.out.println("File not exists return; } FSDataInputStream in = fileSystem.open(path_of_file); int numBytes = 0; while ((numBytes = in.read(b))> 0) { System.out.prinln((char)numBytes)); } in.close(); out.close(); fileSystem.close();
The above code opens and reads the data of a file. The path of this file on HDFS is given to the code as a command-line argument.
One Comment