System Design - The Hadoop Distributed File System

Subscribe Send me a message home page tags

#system design  #HDFS  #hadoop  #file system 

This post is a reading note of the paper The Hadoop Distributed File System. Some of the contents are copied directly from the paper.

Related Readings


Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. An important characteristic of Hadoop is the partitioning of data and computation across many of hosts, and executing application computations in parallel close to their data.

The file system has two types of data:

HDFS stores file system metadata and application data separately. The metadata is stored in NameNode and the application data is stored in DataNode. The architecture described in the paper only has one single NameNode in the cluster. Because the system metadata are always in RAM, the size of the namespace (number of files and directories) is a finite resource.

Main Concepts


There are four types of nodes in a hadoop cluster. They are:

(Note that the architecture described in the paper only has one single NameNode in the cluster.)




During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system. A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients. If either the checkpoint or the journal is missing, or becomes corrupt, the namespace information will be lost partly or entirely.

Read/Write Operation

To read/write a file, the client first connect to the NameNode. The NameNode will provide a list of DataNode that will host the blocks of the file and this forms a pipeline.


More about read and write operation

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?