# Process Mining - Build Dependency Graph from Event Logs

Process mining is a multi-step process. In general, the first step is to generate some descriptive statistics of events in the log such as number of unique events, frequency of events, etc. Then we usually need to build a dependency graph. The dependency graph captures the dependency or causality between events and it's one of the fundamental components in events visualization.

In the book Process Mining: Data Science in Action, the author proposes the following two metrics to characterize dependency between two events:

Definition: Direct Follow.

The direct follow between event $$a$$ and $$b$$, denoted as $$DF(a,b)$$, is the count of event $$a$$ directly followed by event $$b$$.

Definition: Dependency Measure.

The dependency measure is defined as follows

$$DM(a,b) = \begin{cases} \frac{DF(a,b) - DF(b,a)} {DF(a,b) + DF(b,a) + 1} & \;\; \textrm{if} \;a \neq b\\ \frac {DF(a,a)} {DF(a,a) + 1} & \;\; \textrm{if} \;a = b\\ \end{cases}$$

We notice that $$DM(a,b)$$ has value between -1 and 1. If the value is close to 1, it indicates event $$a$$ causes event $$b$$. If the value is close to zero, then it suggests there is no clear causality between event $$a$$ and $$b$$. Use this dependency measure together with a threshold, we will be able to selectively construct a dependency graph. For example, we could draw an edge from any event pair $$a$$ to $$b$$ if the dependency measure value is larger than 0.3.

We can apply this simple algorithm to our sample event log and it gives us the following result: Note that the dependency measure implicitly assumes that there is not circular dependency because only one of $$DM(a,b)$$ and $$DM(b,a)$$ can be positive. If the dependency measure is close to zero, then according to the interpretation, there is no causality relationship between the two events in question. Depending on the chosen threshold, we may not draw an edge between the two nodes.

This behavior might not be desirable if we know in advance that cyclical pattern exists. For example, in application logs cyclical patterns are quite common if there is retry mechanism in place.

We'd like to point out that this limitation is sort of a local property between two events. As we can see in the discovered model on the right in the figure, we do have cycles. For example, the discovered model allows b -> d -> f -> b.

Finally, we'd like to point out as well that the dependency graph presented in this post has a OR semantic by default. For example, we have edge from $$a$$ to $$b$$ and from $$a$$ to $$c$$. This means if the event $$a$$ occurs, we would see it is followed by either an event $$b$$ or an event $$c$$.

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?  