System Design - Facebook's Photo Storage

Subscribe Send me a message home page tags


Related Readings

General Notes

This is a reading note of Facebook's Haystack paper. Interestingly, this paper provides an easy-to-understand description of how CDN works.

CDN.png
When visiting a page the user’s browser first sends an HTTP request to a web server which is responsible for generating the markup for the browser to render. For each image the web server constructs a URL directing the browser to a location from which to download the data. For popular sites this URL often points to a CDN. If the CDN has the image cached then the CDN responds immediately with the data. Otherwise, the CDN examines the URL, which has enough information embedded to retrieve the photo from the site’s storage systems. The CDN then updates its cached data and sends the image to the user’s browser.

A typical URL that directs the browser to the CDN looks like the following:

http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo>

The main reason why Facebook decided to develop the Haystack system is that they need to handle a special pattern of photo request traffic handled by large social media sites. This pattern is called long tail in the paper. It describes the fact that Facebook receives a large number of requests for less popular and often older content. This type of requests usually causes cache miss.

The Haystack persistency API is very simple. It only supports three APIs: (1) read (2) write and (3) delete. Data modification is not supported, which means the data is immutable most of the time. There is a flag in the data record indicating if a photo is deleted but strictly speaking this flag is a metadata.

When data is saved to disk, an append-only approach is used. Append-only approach plus compaction process is a standard method.

System Design Explained

The Haystack architecture consists of 3 components:

overall_architecture_of_Haystack.png

We could think of the Haystack as a micro service. The Haystack Directory is the application that provides the photo file location resolution service. The Haystack Store is the persistency layer, which handles the storage of the actual photo data. The Haystack Cache is the caching layer.

Haystack Directory

The main responsibility of the Haystack Directory is to provide the location of the requested photo. It serves the following four main functions

Note that machines in the Haystack Store have states. When the machine is added to the cluster, it is write-enabled which means it can accept upload requests. Over time, the capacity on the machine decreases and when a machine exhausts its capacity, it is marked as read-only.

Haystack Cache

This is a normal caching layer that sits in front of the persistency layer.

According to the paper, the Haystack Cache caches a photo only if two conditions are met

These two conditions can be considered optimization because Facebook photo requests have special patterns (i.e. long tail).

The Cache provides another benefit: it separates the storage engine and the CDN. In this way, the storage engine does not depend on external components anymore.

Haystack Store

Each Store machine manages multiple physical volumes. Each volume contains millions of photos. Physical volumes are grouped together to form logical volumes. The file system metadata is kept in memory. The most important information is the mapping from photo ID to (volume ID, file offset of photo).

To some extent, physical volumes are similar to replicas.

physical_layout_of_photo_data.png

physical layout of the photo data

When a machine starts up, it needs to re-build the metadata in memeory. It can read all data from the physical volumes but it's time-consuming. As an optimization, Haystack saves checkpoints as index files which are used during the recovery process.

There is a technical detail here. Saving checkpoints is done asynchronously, therefore there may be inconsistency between the index files and the physical volumes. The inconsistency can be easily addressed when the Store engine reads the actual data due to the fact that the storage system is mostly append-only and the records are in chronological order by default.

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?

/static/shopping_demo.png