System Design - Distributed Snapshots and Termination Detection for Diffusing Computation

Subscribe Send me a message home page tags

#system design  #distributed system  #snapshots  #termination detection  #distributed snapshots 

Related Readings

Many of the contents in this blog are copied directly from the original papers but re-ordered.

Problem Description


A distributed system consists of a finite set of processes and a finite set of channels. It is described by a labeled, directed graph in which the vertices represent processes and the edges represent channels.

Processes in a distributed system communicate by sending and receiving messages. A process can record its own state and the messages it sends and receives; it can record nothing else.

The state of a channel is the sequence of messages sent along the channel, excluding the messages received along the channel. The state of a channel \(c\) may be changed by the sending of a message along \(c\) (if \(c\) is directed away from \(p\)) or the receipt of a message along \(c\) (if \(c\) is directed towards \(p\)).A

An event of a distributed system can be characterized by a 5-tuple: \( \langle p, s, s', M, c \rangle \). It is interpreted as "the process \(p\) sends or receives a message \(M\) through the channel \(c\). The state of the process \(p\) right before the handling of the message is \(s\) and the state of the process \(p\) right after the handling of the message is \(s'\)."

A global state of a distributed system is a set of component process and channel states.


The problem is to devise algorithms by which processes record their own states and the states of communication channels so that the set of process and channel states recorded form a global system state. To limit the problem space, we assume that the delay experienced by a message in a channel is arbitrary but finite.

The challenge is that we cannot ensure that the states of all processes and channels will be recorded at the same instant because there is no global clock; however, we require that the recorded process and channel states form a "meaningful" global system state.

Main Idea

The main idea of the Global-State-Detection Algorithm is to use special markers to partition the messages flowing in the channels of the distributed system. Markers are used as a command to instruct the nodes that receive the marker to record its state and the state of incident channels.

Global-State-Detection Algorithm

Marker-Sending Rule for a Process \(p\) - For each channel \(c\), incident on, and directed away from \(p\), \(p\) sends one marker along \(c\) after \(p\) records its state and before p sends further messages aong \(c\).

Marker-Receiving Rule for a Process \(q\) - On receiving a marker along a channel \(c\):

#if q has not recorded its state:
    # q records its state;
    # q records the state c as the empty sequence
    # q records the state of c as the sequence of messages
    # during along c after q's state was recorded and before q received the marker along c.

Rule L1 - No marker remains forever in an incident input channel

Rule L2 - node record its state within finite time of initiation of the algorithm.

How to Collect Global State Records

A simple algorithm for collecting information in a system whose topology is strongly connected is for each process to send the information it records along all outgoing channels, and for each process receiving information for the first time to copy it and propagate it along all of its outgoing channels. All the recorded information will then get to all the processes in finite time, allowing all processes to determine the recorded global state.

Other Notes

In the paper Distributed Snapshots: Determining Global States of Distributed Systems, it mentioned another paper Termination Detection for Diffusing Computations in which Dijkstra and Scholten proposed an elegant algorithm to determine the termination of computation in a distributed system. The original paper is fun to read so we don't provide any notes here other than some key concepts

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?