This post is a reading note of the MapReduce paper and other online materials.
- Master Node: Master node is the coordinator of the system. It is responsible for assigning tasks to workers. If the master node fails, the whole process will be aborted.
- Map Worker: Workers that are responsible for doing map tasks. It saves the outputs on the local disks.
- Reduce Worker: Workers that are responsible for doing reduce tasks. Reduce workers read inputs from map workers and it receives the file location from the master node. Note that before applying the reduce operation, reduce workers need to sort the inputs by key.
There is a hidden component here. The MapReduce system can work because it's built on a global file system. The GFS is very important because it can help to achieve data locality which in turn saves network bandwidth. Distributed file systems usually have multiple replicas for files and this increases the probability that workers already have the needed input file stored on the local disk.
This is highlighted in the Apache MapReduce document:
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.
The workflow consists of the following steps
- Split inputs
- Create and assign map tasks and reduce tasks.
- Execute map tasks and the outputs are saved in intermediate files on the local disk of map worker machines. Note that the outputs are partitioned.
- Partitioned outputs are read by the reduce workers. Each reduce worker handles a specific partition determined by the partitioning function. The process of fetching the intermediate data is called shuffle.
- Reduce workers sort the received records by key.
- Reduce workers apply the reduce function and write the results to an output file.
The above diagram is a little bit "misleading": the input and output files seem to be outside the cluster. In reality they can be stored on the machines in the cluster.)
----- END -----
©2019 - 2021 all rights reserved