# Log Analysis - Preliminary Reading

• Introduction
• Log Analysis Workflow
• Data Types And Feature Matrix Construction
• Event Count Approach
• Natural Language Processing Approach
• Challenges
• Thoughts

### Introduction

Debugging is an essential part of being a software engineer. However, industry-level applications are often complex and it's almost impossible for a software engineer to understand all aspects. To some extent, this is "by design" because to boost team productivity software engineering teams usually adopt modular programming which emphasizes separating functionality. The idea is that we can then have small teams dedicated to different sub-components of the system. This arrangement can improve productivity because the code change made by small teams will be localized, which accelerates the development and makes testing easier.

The downside of this setup is that software engineers only have a partial view of the system. This is particularly problematic for debugging for two reasons

• Tricky bugs usually involve multiple sub-components in the system. Moreover, the cause and the effect may not be "close" to each other.
• A partial view means software engineers may not know the expected behavior of sub-components owned by other small teams. If they don't know the expected behavior, they cannot tell the abnormal behavior either. This makes debugging impossible. The only way to move the investigation forward is to get other teams involved and asks them to explain everything. This can be very time-consuming.

So the question is: is it possible to debug and perform root cause analysis without having a complete understanding of the system? If we think about it, this is just another use case of machine learning techniques. Computers don't need to understand objects to classify images and similarly, they don't need to fully understand the business logic to detect issues.

This post is a reading note of several papers related to log analysis. We will present the key ideas described in those papers and highlight some of the challenges for automatic log analysis.

### Log Analysis Workflow

As pointed out in one of the papers, having human reading application logs is not scalable. The application logs are huge for complex systems. It usually contains lots of low-level information, which may not help debug. Most of the time, noisy information in the logs creates a distraction.

On the other hand, large volume of logs provides an opportunity to apply data-driven analysis. We are actually in a good position here. Although the logs are noisy, the underlying system has a "normal" state most of the time which means we can expect to see patterns in the data. This is not the case for systems such as stock market where system behavior (i.e. price movement) can be completely random.

By definition, application logs are human-readable records of events. This format is not desirable for data analysis. As we will see later, one of the major challenges is how we can transform application logs into a numeric representation.

In general, log analysis consists of four steps

• Log collection
• Log parsing
• Feature Extraction
• Analysis (e.g. Anomaly Detection)

Log collection and parsing is not our concern in this post. We will assume well-formated text application logs are available and we will focus on the feature extraction and analysis part.

### Data Types And Feature Matrix Construction

In this section, we will talk about feature extraction.

After parsing logs into separate events, we need to further encode them into numerical feature vectors, whereby machine learning models can be applied. The raw inputs of log analysis are of course application log files. The question is how we can transform this raw form into something that can be consumed more easily by machine learning models or other algorithms. At the end of the day, we need some numerical representation of log messages, which should capture the most important information in the log files and filter out noisy records.

In general, there are two types of data included in log files

• numerical data
• event data

Numerical data is generally related to system states. For example, CPU usage, available disk space, request rate are all numerical. Numerical data is ready to be used in machine learning models with minimum parsing effort.

Event data are records in log files that describe what has happened. For example, we may have the following lines in the application logs:

t = 0, received transaction request with ID = 10
t = 3, start process transaction(id=10)
t = 5, validate user input in transaction(id=10)
t = 12, processed transaction(id=10) sucessfully
t = 14, send reply to requestor of transaction(id=10)
t = 30, received transaction request with ID = 11
...


Each line in the above log represents an event and most of the time events can be grouped based on some sort of request ID or case ID. This type of information can not be used directly by machine learning models because it's not numeric.

Two methods exist to transform event data into numerical feature matrix:

• Event count
• Apply natural language processing techniques

#### Event Count Approach

Event count approach consists of two steps

• define time window
• count events in the time window

There are three ways to define the time window

• Fixed window
• Sliding window
• Session window

The figure below uses the fixed window method:

Notice that event count approach loses the information of event orders.

#### Natural Language Processing Approach

In its brutal form, we treat log files as plain (English) text and apply standard natural language processing techniques. For example, we can process the log files and generate the distributed representation of words so that we can represent a word using a numerical vector. Each logging line is simply considered as a collection of words (word vectors) and can be represented by the barycenter of the word vectors it contains.

There are two types of tasks when performing log analysis

• anomaly detection
• root cause analysis

Both of the tasks can be framed as a classification problem. For anomaly detection, the output is one of the two system states:

1. normal state and
2. abnormal state

For root cause analysis, there will be more classes: normal state, root cause 1, root cause 2, ..., root cause K. Note that the abnormal states in this case are just labels of data points therefore we need to use supervised learning algorithm.

Anomaly detection can also be done using regression models. For example, we can monitor system performance and resource usage. In such a way, the system state is represented by a numeric vector and this is our $$Y$$ in the model. Then we can define an abnormal state by specifying thresholds. For example, if the CPU usage of any host is above 90%, then the system state is considered abnormal.

### Challenges

The main challenge is to figure a way to transform log files into a numerical form that captures the essential information of events. Ideally, the transformation should filter out noisy data and keep the most important information. It should "understand" the relative importance of different events as well.

Note that both event count approach and natural language processing approach lose sequence information. For example, the fact that event A happens before event B is not represented in either approach.

There is another challenge that is caused by concurrent events. For example, case 1 and case 2 in the figure below can be identical sequences of events but they may have different log lines due to concurrency. This is annoying because it inflates the number of possible sequences of events exponentially.

Similar to natural language processing tasks, we also need to introduce "memory" to the model. The fact that we fail to process request X doesn't necessarily mean there is anything wrong with the request X itself. The failure might be caused by something that happened before request X. Sometimes the relative order matters. Suppose we have two consecutive events: event 1 and event 2 and event 2 fails. It might be possible that event 2 itself is valid and it only causes problems when it immediately happens after event 1. For there is no memory concept in event count approach, it cannot discover such relationship and it may appear to the model that event 2 fails randomly.

### Thoughts

Event count approach is a good starting point. Once we construct the event count matrix, we can apply standard classification algorithms to get some initial requests. However, it misses some valuable information such as

• order of events
• start time and completion time of events

Log analysis is similar to natural language processing tasks in many ways, therefore applying natural language processing techniques seems to be the right direction. In some way, log analysis can be considered as a special use case of natural language processing tasks although it has its own challenges such as handling concurrent events.

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?