Table of Contents
- Related Readings
- Introduction
- Building Blocks
- Znode
- Watch and Notification
- Sessions
- Multiop
- Order in ZooKeeper
- Order vs Simultaneousness
- Out-of-Order
- General Notes
Related Readings
Introduction
ZooKeeper can be thought of as a coordination service in a distributed architecture. As a service, it has two parts: (1) client and (2) server. The distributed application use ZooKeeper clients to communicate with ZooKeeper server and thus coordinate with other application instances.

In this post, we will present the building blocks in ZooKeeper. Then, we will talk about the ordering in ZooKeeper and we will see there is a global order maintained by ZooKeeper. We complete this post by presenting some general notes.
Building Blocks
Znode
ZooKeeper uses znodes as the primitives for coordination. Znodes are organized hierarchically as a tree, much like a file system. A znode can be considered a tuple of (path, data, version). The data part is optional and is usually to store some metadata. Keep in mind that ZooKeeper is not a data store system and we should avoid using it for other purposes than coordination. The version is used to support conditional operation.
Znode has three modes
- persistent mode
- ephemeral mode
- sequential znodes
A persistent znode is not associated with a session. On the other hand, an ephemeral znode is associated with a session and is deleted if the session is deleted or expired.
Sequential znodes are used to create a group of znodes with unique names. A unique and increasing ID is appended to the group name.
Here is an example of znodes from the book:

One important note is that ZooKeeper does not allow partial writes or reads of the znode data. When setting the data of a znode or reading it, the content of the znode is replaced or read entirely.
Watch and Notification
A watch is a one-time trigger associated with a znode and a type of event. There are five event types:
- NodeCreated
- NodeDeleted
- NodeDataChanged
- NodeChildrenChanged
- None
None is used when the watched event is for a change of the state of the ZooKeeper session.
Client can register a watch on a znode, which means "send me a notification when the znode changes". The registered watch can only be used once (strictly speaking, a watch is triggered at most once) and if the client wants to continue monitoring the znode, it needs to register a new watch. Using a one-time trigger is a design decision because sending a notification for every znode change creates too much pressure on the system.
Note that each watch is associated with the session in which the client set it. If the session expires, pending watches are removed. Watches do, however, persist across connections to different servers.
Sessions
Sessions constitute an important abstraction in ZooKeeper. Ordering guarantees, ephemeral znodes, and watches are tightly coupled to sessions. The session tracking mechanism is consequently very important ZooKeeper.
Before executing any request against a ZooKeeper cluster, a client must establish a session with the service. All operations a client submits to ZooKeeper are associated with a session. One functionality provided by session is order guarantee. Requests submitted in a session are executed in FIFO order. However, if a client has multiple concurrent session, FIFO ordering is not necessarily preserved across the sessions.
When a session ends for any reason, the ephemeral nodes created during that session disappear. One important parameter when creating a session is the session timeout, which is the amount of time the ZooKeeper service allows a session before declaring it expired.
The ZooKeeper client library takes care of reconnecting to the service and transfer the session. In this way, the details of the server we communicate with are hidden from the client.
Note that both client and ZooKeeper server can change the state of a session. However, we should treat the state reported by the server as the source of truth.
The diagram below shows the session states and transitions:

Session states and transitions
Multiop
Multiop is similar to the transaction concept in a database system; it enables the execution of multiple ZooKeeper operations in a block atomically. The execution is atomic in the sense that either all operations in a multiop block succeed or all fail.
Order in ZooKeeper
ZooKeeper maintains a global order of messages and notifications. Every change to the state of ZooKeeper is totally ordered with respect to all other executed updates. One important guarantee of notifications is that they are delivered to a client before any other change is made to the same znode. (To build a mental model of this behavior, we could assume sending the notification is part of the processing of the message that updates the znode.)
Order vs Simultaneousness
There is a global order of messages and notifications. The state updates can be thought of as "commits" and servers in the ZooKeeper cluster have consensus on the order of commits. This setup is similar to multi paxos. In fact, ZooKeeper has leaders, followers, and observers, which roughly correspond to the distinguished proposer, accepters, and learners in the Paxos algorithm.
However, the existence of a global order doesn't mean there is a unique view of the state of the system nor a view is available to all servers simultaneously. This should not be surprising because "at the same time" does not exist in any distributed system.
Being not on the same page all the time is not a problem as long as there is no direct communication (hidden channel) between clients. For example, in the diagram below, if client 1 communicates with client 2 after it knows znode z
is set to B, then from the perspective of client 2, it now has two different views on the value of the znode z
.

As a best practice, we should avoid any hidden channels. If an application decides to leverage ZooKeeper, it doesn't want to handle the coordination in the first place so using hidden channels to coordinate is unnecessary.
Out-of-Order
Messages and requests can be out of order. This usually happens when there is a retry. For example, the figure below describes an out-of-order scenario:

Suppose we send a request R1 at t = 0 and at t = 1 the client disconnects temporarily and the disconnection event triggers a resend which is done asynchronously.
At t = 2, the connection resumes and in the "main thread" we send a second request R2. We further assume the network is not stable and immediately after R2 is sent, the client disconnects again. Eventually, the network becomes stable again and the request R1 is sent to the server successfully. From the client's perspective, it wants to send R1 followed by R2 and this is actually the order in the code of the "main thread". However, from the server's perspective, it sees R2 first.
General Notes
- Data (e.g. znodes) managed by ZooKeeper is called coordination data. It's a good practice to separate such coordination data from application data.
- Whenever the (distributed) system detects inconsistencies, it should try to reconcile the difference immediately.
- In a distributed system, it's hard to prove a node or a client has crashed. Therefore, when we suspect that a client has crashed, we actually need to react by assuming that it could just be slow, and that it may execute some other actions in the future.
- To preserve order, there is a single callback thread and responses are processed in the order they are received. The callback is involved when we use asynchronous APIs. Because there is only one single callback thread, we should avoid putting any potentially blocking operations in the callback function. Otherwise, there is a risk that a problematic callback function could prevent the client from processing other responses from the server.
----- END -----
©2019 - 2022 all rights reserved