Process Mining - Learn Split and Join from Event Log
In the post Build Dependency Graph from Event Logs, we have seen how we can build a dependency graph from event logs. In this post, we will continue the study of process mining and learn how to extract split and join information.
We reuse the dependency information provided in the Flexible Heuristics Miner paper. The dependency table in the figure below is copied from the paper directly and the dependency graph on the right side is constructed using that table.
What is split and join?
Suppose we want to process a customer payment. For the sake of simplification, we assume there are only four steps:
- a: receive payment request
- b: validate user ID
- c: validate user payment information
- d: process the payment
In the code we may have something like
def paymentRequestHandler(): if isUserIdValid() and isPaymentInformationValid(): processPayment()
As we always perform validation before processing the payment, this dependency will be reflected in the event logs and the reconstructed dependency graph.
What is missing in the dependency graph however is the condition. Conditions are implemented in the form of if-else block in the code and they are sources of system complexity. In an ideal case, we want to add the following information to the dependency graph above
- user ID is always validated (b)
- the payment is processed (d) only if both user Id (b) and payment information (c) are valid.
There is a way to visualize this information using a dependency graph. For example:
As the above graph shows, there could be a split after activity \(a\) and in order to execute activity \(d\) we must have both \(b\) and \(c\) already executed. The latter condition is a join.
The figure below illustrates the process of learning split and join. The inputs of the learning algorithm are
- event logs
- dependency graph
The outputs of the learning algorithm are the counts of input and output combination of activities.The table below in the Flexible Heuristics Miner paper gives an example of such outputs:
To calculate the counts of input and output combination, we need to go through the event logs case by case. Recall that each case provides a trace. For instance, suppose we have a trace ABDCIFIJEGHK in the log. We will scan the trace from the right to the left and identify the nearest activator of activities. This strategy is called Nearest Candidate Strategy, which is a greedy algorithm.
The result of this scan is as follows
In this particular example, the scan of the trace indicates that there is an AND-split for activity A:
Note that we are "lucky" with this trace because after the initial scan, there is no leftover, i.e. all activities in the trace have an output arrow. For any activity that doesn't have an outgoing arrow, we will do a second scan from left to right and apply the Nearest Candidate Strategy again. This time, we will need to identify the nearest impacted activity of a given activity.
Some thoughts on the heuristics. Clearly, the Nearest Candidate Strategy is a greedy algorithm that does not guarantee its correctness. However, this assumption is relatively reasonable. This is especially true for application logs if the code that generates the logs follows best practice such as grouping logically related code together.
----- END -----
©2019 - 2023 all rights reserved