This post is a reading note of the paper The Hadoop Distributed File System. Some of the contents are copied directly from the paper.
Related Readings
Introduction
Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. An important characteristic of Hadoop is the partitioning of data and computation across many of hosts, and executing application computations in parallel close to their data.
The file system has two types of data:
- Application data: This is the data that users store in the HDFS.
- System metadata: This is the data that describes the structure of application data and it is maintained by HDFS.
HDFS stores file system metadata and application data separately. The metadata is stored in NameNode and the application data is stored in DataNode. The architecture described in the paper only has one single NameNode in the cluster. Because the system metadata are always in RAM, the size of the namespace (number of files and directories) is a finite resource.
Main Concepts
- HDFS namespace
- The HDFS namespace is a hierarchy of files and directories.
- Files and directories are represented on the NameNode by inodes
- Block of Files
- Block is the storage unit. A file consists of many blocks.
- Each block of the file is independently replicated at multiple DataNode.
- NameNode maintains the mapping between the blocks of files and the DataNode.
- Image
- Image consists of the inode data and the list of blocks belonging to each file
- Checkpoint File
- This is the persistent record of an Image.
- The checkpoint file is never changed by the NameNode but it can be replaced.
- Journal
- This is a persistency file that traces the changes of the file system.
- Snapshot
- A snapshot is basically a backup of the whole file system at a given instant.
- It's relatively easy to create a backup of metadata. We can merge the checkpoint file and the journal and save it to a new file.
- We cannot save a copy of the user data because that would mean we need to double the space consumption. HDFS uses hard link instead as the copy of the file.
- Balancer
- HFS block placement strategy does not take into account DataNode disk space utilization.
- This is to avoid placing new -- more likely to be referenced -- data at a small subset of the DataNodes. Therefore data might not always be placed uniformly across DataNodes. Imbalance also occurs when new nodes are added to the cluster.
Components
There are four types of nodes in a hadoop cluster. They are:
- NameNode
- The NameNode endeavors to ensure that each block always has the intended number of replicas. The NameNode detects that a block has become under- or over-replicated when a block report from a DataNode arrives.
- It has a replication priority queue used to prioritize replication tasks.
- The NameNode also makes sure that not all replicas of a block are located on one rack.
- The NameNode is a multithreaded system and processes requests simultaneously from multiple clients. Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initiated by one of them is complete.
- In order to optimize this process the NameNode batches multiple transactions initiated by different clients.
- The NameNode is the coordinator of the whole system.
- DataNode
- Application data is stored on DataNode.
- CheckpointNode
- The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal.
- It is essentially a worker node.
- BackupNode
- The BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.
- It is basically a replica of the NameNode.
(Note that the architecture described in the paper only has one single NameNode in the cluster.)


Workflow
Startup
During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system. A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients. If either the checkpoint or the journal is missing, or becomes corrupt, the namespace information will be lost partly or entirely.
Read/Write Operation
To read/write a file, the client first connect to the NameNode. The NameNode will provide a list of DataNode that will host the blocks of the file and this forms a pipeline.


More about read and write operation
- HDFS implements a single-writer, multiple-readers model.
- Writers needs to first acquire the lease.
- The writer's lease does not prevent other clients from reading the file; a file may have many concurrent readers.
- HDFS permits a client to read a file that is open for writing. When reading a file open for writing, the length of the last block still being written is unknown to the NameNode. In this case, the client asks one of the replicas for the latest length before starting to read its content.
----- END -----
©2019 - 2022 all rights reserved