# Introduction to Distributed System Tracing

Par of the content of this post is copied directly from the documents listed in the Related Reading section.

### Introduction

Distributed tracing is a type of correlated logging that helps you gain visibility into the operation of a distributed software system for use cases such as performance profiling, debugging in production, and root-cause analysis of failures or other incidents. Debugging distributed systems is challenging. As the application becomes more distributed, the coherence of failures begins to decrease. That is to say, the distance between cause and effect increases. Therefore, tools that provide local information, such as logging, are not enough to handle this complicated situation.

The whole idea is to get better observability. The term observability formally means that the internal states of a system can be inferred from its external outputs. This became necessary within these organizations as the complexity of their systems grew so large — and the number of people responsible for managing them stayed relatively small — that they needed a way to simplify the problem space.

There are two parts in "distributed tracing". The first one is "distributed". Roughly speaking, distributed means "all over the place". The second one is "tracing". Let's compare tracing with logs. Logs provide extremely fine-grained detail on a given service but have no built-in way to provide that detail in the context of a request. In this sense, logs only provide local information. The following code shows an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
def doSomethingFirstAndGetIntermediateResult(request):
# do something
return result

def applyAnotherFunction(result):
log.info("process intermediate result.")
return response

def handleRequest(requst):
result = doSomethingFirstAndGetIntermediateResult(request)
response = applyAnotherFunction(result)


What is missing in the logging is the context that links the request, the intermediate result and the response. It would be super helpful for debugging and performance analysis if applyAnotherFunction knows that the intermediate result is generated while handling a specific request.

Distributed tracing is all about context propagation: we need each service to know about the caller’s trace, and each service we call out to needs to know that it’s included in a trace as well.

### Key Concepts

In this section, we take a look at Google's Dapper and see how it models distributed context. Dapper models traces using trees, spans, and annotations. In a Dapper trace tree, the tree nodes are basic units of work which are referred to as spans. The edges indicate a causla relationship between a span and its parent span. Dapper records a human-readable span name for each span, as well as a span id and parent id. Spans created without a parent id are known as root spans. Therefore, we have the following key components in a distributed tracing system:

• trace tree
• trace id
• span
• span id
• parent span id
• span name
• timestamped events.

Note: although span is the basic unit of work, conceptually, it's nothing more than a pair of grouped start and end events. It's useful for representing request/response communication such as RPC, thread pool or other executor. However, it cannot handle message-based communication well. The problem with message-based communication and other pure event-driven system is that the communication is one-way. There is no response so we don't have a span. Instead, we only have "points" in a timeline.

### Design Goals and Challenges

At Google's scale, every architecture design is challenging. However, we don't need to deal with thousands of servers on a daily basis and the scalability is not really our concern here in this post.

One of the biggest challenges, which is also mentioned in the Dapper paper, is application-level transparency.

programmers should not need to be aware of the tracing system. [...] True application-level transparency, possibly our most challenging design goal, was achieved by restricting Dapper's core tracing instrumentation to a small corpus of ubiquitous threading, control flow, and RPC library code.

Having a shared common library is the key, without which manual refactoring becomes necessary. That may not be that bad because programmers need to add textual annotation-based anyway.

The second challenge is to minimize overhead. We don't need to collect all trace data because systems perform well most of the time. To reduce the load, we can sample the trace data. A natural question is how to control the sampling rate. For a large monitoring system, we should avoid any manual tunning because it's not maintainable. The right way to do it is adaptive sampling. The idea is that the system should be able to adjust the sampling rate itself. The Dapper paper doesn't provide any details though.

The third challenge is visualization. After we collect the trace data, we need to figure out a way to present it.

### Example of Architecture

Here we present the high-level architecture mentioned in the Dapper paper. The tracing information is first written to local files on production servers. Each production server has a running Dapper daemon which sends the information to Dapper collectors. Dapper collectors are responsible to store the information in a central repository. Note that the trace data is sparse if presented in a table format, therefore, the data repository needs to have good support for sparse data.

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?