Introduction to ZooKeeper

Subscribe Send me a message home page tags

#zookeeper  #distributed system 

Table of Contents

Related Readings


ZooKeeper can be thought of as a coordination service in a distributed architecture. As a service, it has two parts: (1) client and (2) server. The distributed application use ZooKeeper clients to communicate with ZooKeeper server and thus coordinate with other application instances.


In this post, we will present the building blocks in ZooKeeper. Then, we will talk about the ordering in ZooKeeper and we will see there is a global order maintained by ZooKeeper. We complete this post by presenting some general notes.

Building Blocks


ZooKeeper uses znodes as the primitives for coordination. Znodes are organized hierarchically as a tree, much like a file system. A znode can be considered a tuple of (path, data, version). The data part is optional and is usually to store some metadata. Keep in mind that ZooKeeper is not a data store system and we should avoid using it for other purposes than coordination. The version is used to support conditional operation.

Znode has three modes

A persistent znode is not associated with a session. On the other hand, an ephemeral znode is associated with a session and is deleted if the session is deleted or expired.

Sequential znodes are used to create a group of znodes with unique names. A unique and increasing ID is appended to the group name.

Here is an example of znodes from the book:


One important note is that ZooKeeper does not allow partial writes or reads of the znode data. When setting the data of a znode or reading it, the content of the znode is replaced or read entirely.

Watch and Notification

A watch is a one-time trigger associated with a znode and a type of event. There are five event types:

None is used when the watched event is for a change of the state of the ZooKeeper session.

Client can register a watch on a znode, which means "send me a notification when the znode changes". The registered watch can only be used once (strictly speaking, a watch is triggered at most once) and if the client wants to continue monitoring the znode, it needs to register a new watch. Using a one-time trigger is a design decision because sending a notification for every znode change creates too much pressure on the system.

Note that each watch is associated with the session in which the client set it. If the session expires, pending watches are removed. Watches do, however, persist across connections to different servers.


Sessions constitute an important abstraction in ZooKeeper. Ordering guarantees, ephemeral znodes, and watches are tightly coupled to sessions. The session tracking mechanism is consequently very important ZooKeeper.

Before executing any request against a ZooKeeper cluster, a client must establish a session with the service. All operations a client submits to ZooKeeper are associated with a session. One functionality provided by session is order guarantee. Requests submitted in a session are executed in FIFO order. However, if a client has multiple concurrent session, FIFO ordering is not necessarily preserved across the sessions.

When a session ends for any reason, the ephemeral nodes created during that session disappear. One important parameter when creating a session is the session timeout, which is the amount of time the ZooKeeper service allows a session before declaring it expired.

The ZooKeeper client library takes care of reconnecting to the service and transfer the session. In this way, the details of the server we communicate with are hidden from the client.

Note that both client and ZooKeeper server can change the state of a session. However, we should treat the state reported by the server as the source of truth.

The diagram below shows the session states and transitions:


Session states and transitions


Multiop is similar to the transaction concept in a database system; it enables the execution of multiple ZooKeeper operations in a block atomically. The execution is atomic in the sense that either all operations in a multiop block succeed or all fail.

Order in ZooKeeper

ZooKeeper maintains a global order of messages and notifications. Every change to the state of ZooKeeper is totally ordered with respect to all other executed updates. One important guarantee of notifications is that they are delivered to a client before any other change is made to the same znode. (To build a mental model of this behavior, we could assume sending the notification is part of the processing of the message that updates the znode.)

Order vs Simultaneousness

There is a global order of messages and notifications. The state updates can be thought of as "commits" and servers in the ZooKeeper cluster have consensus on the order of commits. This setup is similar to multi paxos. In fact, ZooKeeper has leaders, followers, and observers, which roughly correspond to the distinguished proposer, accepters, and learners in the Paxos algorithm.

However, the existence of a global order doesn't mean there is a unique view of the state of the system nor a view is available to all servers simultaneously. This should not be surprising because "at the same time" does not exist in any distributed system.

Being not on the same page all the time is not a problem as long as there is no direct communication (hidden channel) between clients. For example, in the diagram below, if client 1 communicates with client 2 after it knows znode z is set to B, then from the perspective of client 2, it now has two different views on the value of the znode z.


As a best practice, we should avoid any hidden channels. If an application decides to leverage ZooKeeper, it doesn't want to handle the coordination in the first place so using hidden channels to coordinate is unnecessary.


Messages and requests can be out of order. This usually happens when there is a retry. For example, the figure below describes an out-of-order scenario:


Suppose we send a request R1 at t = 0 and at t = 1 the client disconnects temporarily and the disconnection event triggers a resend which is done asynchronously.

At t = 2, the connection resumes and in the "main thread" we send a second request R2. We further assume the network is not stable and immediately after R2 is sent, the client disconnects again. Eventually, the network becomes stable again and the request R1 is sent to the server successfully. From the client's perspective, it wants to send R1 followed by R2 and this is actually the order in the code of the "main thread". However, from the server's perspective, it sees R2 first.

General Notes

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?