System Design - Distributed Snapshots and Termination Detection for Diffusing Computation
- Distributed Snapshots: Determining Global States of Distributed Systems (local copy)
- Termination Detection for Diffusing Computations (local copy)
A distributed system consists of a finite set of processes and a finite set of channels. It is described by a labeled, directed graph in which the vertices represent processes and the edges represent channels.
Processes in a distributed system communicate by sending and receiving messages. A process can record its own state and the messages it sends and receives; it can record nothing else.
The state of a channel is the sequence of messages sent along the channel, excluding the messages received along the channel. The state of a channel \(c\) may be changed by the sending of a message along \(c\) (if \(c\) is directed away from \(p\)) or the receipt of a message along \(c\) (if \(c\) is directed towards \(p\)).A
An event of a distributed system can be characterized by a 5-tuple: \( \langle p, s, s', M, c \rangle \). It is interpreted as "the process \(p\) sends or receives a message \(M\) through the channel \(c\). The state of the process \(p\) right before the handling of the message is \(s\) and the state of the process \(p\) right after the handling of the message is \(s'\)."
A global state of a distributed system is a set of component process and channel states.
The problem is to devise algorithms by which processes record their own states and the states of communication channels so that the set of process and channel states recorded form a global system state. To limit the problem space, we assume that the delay experienced by a message in a channel is arbitrary but finite.
The challenge is that we cannot ensure that the states of all processes and channels will be recorded at the same instant because there is no global clock; however, we require that the recorded process and channel states form a "meaningful" global system state.
The main idea of the Global-State-Detection Algorithm is to use special markers to partition the messages flowing in the channels of the distributed system. Markers are used as a command to instruct the nodes that receive the marker to record its state and the state of incident channels.
Marker-Sending Rule for a Process \(p\) - For each channel \(c\), incident on, and directed away from \(p\), \(p\) sends one marker along \(c\) after \(p\) records its state and before p sends further messages aong \(c\).
Marker-Receiving Rule for a Process \(q\) - On receiving a marker along a channel \(c\):
#if q has not recorded its state: # q records its state; # q records the state c as the empty sequence #else: # q records the state of c as the sequence of messages # during along c after q's state was recorded and before q received the marker along c.
Rule L1 - No marker remains forever in an incident input channel
Rule L2 - node record its state within finite time of initiation of the algorithm.
How to Collect Global State Records
A simple algorithm for collecting information in a system whose topology is strongly connected is for each process to send the information it records along all outgoing channels, and for each process receiving information for the first time to copy it and propagate it along all of its outgoing channels. All the recorded information will then get to all the processes in finite time, allowing all processes to determine the recorded global state.
In the paper Distributed Snapshots: Determining Global States of Distributed Systems, it mentioned another paper Termination Detection for Diffusing Computations in which Dijkstra and Scholten proposed an elegant algorithm to determine the termination of computation in a distributed system. The original paper is fun to read so we don't provide any notes here other than some key concepts
- the environment
- the gate
- cornet: verify first in, very last out.
- neural node
- engaged node
- engagement tree
- the ultimate state
----- END -----
©2019 - 2022 all rights reserved