Description
SOURCE
and TARGET
functions provide a way to combine values from two different rows of the activity table
into the same row, e.g. for calculating the throughput time between consecutive events inside a case.
In process mining applications it is often required to relate an event to another event which directly or eventually follows. For instance, this is required to compute the throughput time between two events by calculating the difference between the corresponding timestamps. Due to the relational data model, the timestamp values to subtract are stored in different rows. However, the operators (e.g. arithmetic operations) usually can only combine values from the same row. Therefore, we need a way to combine values from two different rows into the same row for performing such computations.
To overcome this issue, PQL relies on the SOURCE
and TARGET
operators.
Syntax
Adds the values from a column, which correspond to a source activity, to a temporary table based on the optional filter column and the optional edge configuration:
SOURCE ( activity_table.column [, activity_table.filter_column ] [, edge_configuration ] )
Adds the values from a column, which correspond to a target activity, to a temporary table based on the optional filter column and the optional edge configuration:
TARGET ( activity_table.column [, activity_table.filter_column ] [, edge_configuration ] )
- activity_table.column: Column of the activity table. Its values are mapped to the referred events and returned as result column. The result column is stored in a temporary result table which can be joined with the case table.
- activity_table.filter_column: Optional filter column to skip certain events. Events with a NULL value in the related entry of the filter column are ignored. Usually, the filter column is created using REMAP_VALUES. See below for detailed explanation.
- edge_configuration: Describes which edges between the activities of a case are considered. Must be one of:
- ANY_OCCURRENCE[] TO ANY_OCCURRENCE[] (DEFAULT)
- FIRST_OCCURRENCE[] TO ANY_OCCURRENCE[]
- FIRST_OCCURRENCE[] TO ANY_OCCURRENCE_WITH_SELF[]
- ANY_OCCURRENCE[] TO LAST_OCCURRENCE[]
- FIRST_OCCURRENCE[] TO LAST_OCCURRENCE[]
The following example illustrates how SOURCE
and TARGET
can be used to compute the throughput time
between an event and its direct successor. While SOURCE
always refers to the actual event, TARGET
refers to its following
event inside the case. Such a connection is called an edge.
Consequently, SOURCE
and TARGET
can be used to combine an event with its following event in the same row of a table.
Both operators accept a column of the activity table as input and return the respective value of the referred event,
as illustrated in the following example.
Besides the computation of custom process KPIs, like the throughput time between certain activities, SOURCE
and TARGET
also enable more advanced use cases, like the Segregation of Duties.
For the first event in the Activities table, SOURCE
returns the activity name 'A' of
the current event, while TARGET
returns the activity name 'B' of the following event
(refer to row 1 in the example result). For the second event of the input table, SOURCE
returns 'B'
and TARGET
returns 'C' (row 2 in the example result) while they return 'C' and 'D' for
the third event (row 3 in the example result). For case 2, the edge 'B' to 'D' is returned (row 4 in the example result).
The example also demonstrates how the SOURCE
and TARGET
operators can be used to compute the throughput time.
Instead of the activity column, we can use the column containing the timestamp of the events as input.
Consequently, SOURCE
and TARGET
return the timestamps of the referred events. Then, we can pass the result
columns of the SOURCE
and TARGET
operators to the MINUTES_BETWEEN operator to compute the difference between the timestamps of an event and its following event in minutes.
In this example, this results in throughput times of 1 minute from 'A' to 'B', 6 minutes from 'B' to 'C',
and 2 minutes from 'C' to 'D' in the first case as well as 3 minutes from 'B' to 'D' in the second case:
Column1
"CASES"."CASE_ID"
Column2
SOURCE ( "ACTIVITIES"."ACTIVITY" )
Column3
TARGET ( "ACTIVITIES"."ACTIVITY" )
Column4
MINUTES_BETWEEN ( SOURCE ( "ACTIVITIES"."TIMESTAMP" ) , TARGET ( "ACTIVITIES"."TIMESTAMP" ) )
NULL handling
Rows from activity_table.column, which are NULL, are also NULL in the output column. The SOURCE / TARGET operators ignore rows in which the value of the filter column are NULL.
Filter Column
To skip certain events, the SOURCE
and TARGET
operators accept an optional filter column as a parameter.
This column must be of the same size as the activity table. The SOURCE
and TARGET
operators ignore all events that
have a NULL value in the related entry of the filter column. Usually, the filter column is created using the
REMAP_VALUES operator.
SOURCE
operator as filter column. As both activities 'B' and 'C' are mapped to NULL, the next subsequent
activity of 'A' is 'D' in case 1, with a throughput time of 9 minutes. Case 2 only consists of a single activity ('D')
after filtering. Because SOURCE
and TARGET
always refer to two separate events, no edge is generated and case 2 is not
contained in the result at all:
Column1
"CASES"."CASE_ID"
Column2
SOURCE ( "ACTIVITIES"."ACTIVITY" , REMAP_VALUES ( "ACTIVITIES"."ACTIVITY" , [ 'B' , NULL ] , [ 'C' , NULL ] ) )
Column3
TARGET ( "ACTIVITIES"."ACTIVITY" )
Column4
MINUTES_BETWEEN ( SOURCE ( "ACTIVITIES"."TIMESTAMP" ) , TARGET ( "ACTIVITIES"."TIMESTAMP" ) )
Edge Configurations
To define which relationships between the events should be considered, the operators offer the optional edge configuration parameter. The following edge configuration options are available:
(a) ANY_OCCURRENCE[] TO ANY_OCCURRENCE[] (DEFAULT)
(b) FIRST_OCCURRENCE[] TO ANY_OCCURRENCE[]
(c) FIRST_OCCURRENCE[] TO ANY_OCCURRENCE_WITH_SELF[]
(d) ANY_OCCURRENCE[] TO LAST_OCCURRENCE[]
(e) FIRST_OCCURRENCE[] TO LAST_OCCURRENCE[]
The first option (a) is the default and only considers the direct follow relationships between the events, while option (b) only considers relationships from the first event to all subsequent events. Option (c) is similar to option (b) but also considers self-loops of the first event. Option (d) is the opposite of option (b) and only considers relationships going from any event to the last event. Finally, option (e) only considers the relationship between the first and the last event. The different options enable the user to compute KPIs between different activities of the process. For example, you can use option (b) to compute how many minutes after the start of the process (indicated by the first activity 'A') an activity was executed. This is illustrated in the following example:
SOURCE
always refers to the first event of a case (activity 'A' in case 1 and activity 'B' in case 2)
while TARGET
refers to any other event (activities 'B', 'C', and 'D' in case 1 and activity 'D' in case 2).
Consequently, MINUTES_BETWEEN computes the minutes elapsed between the occurrence of 'A' (or 'B' in case 2) and all the other activities of the
corresponding case. For computing the remaining process execution time for each activity of the process, you can
simply adapt the edge configuration to option (d):
Column1
"CASES"."CASE_ID"
Column2
SOURCE ( "ACTIVITIES"."ACTIVITY" , FIRST_OCCURRENCE [ ] TO ANY_OCCURRENCE [ ] )
Column3
TARGET ( "ACTIVITIES"."ACTIVITY" )
Column4
MINUTES_BETWEEN ( SOURCE ( "ACTIVITIES"."TIMESTAMP" ) , TARGET ( "ACTIVITIES"."TIMESTAMP" ) )
Configuration Propagation
To simplify the query, the optional edge configuration and the filter column need to be defined in only one occurrence
of SOURCE
or TARGET
per query. The settings are implicitly propagated to all other operators
in the same query. This can be seen in the examples above, where all SOURCE
and TARGET
occurrences
inherit the filter column or the edge configuration from the first SOURCE
operator of the query.
Join behavior
SOURCE
and TARGET
generate temporary edge tables. A separate edge table is created for each unique edge and filter
column configuration inside a query. SOURCE
and TARGET
operators with identical edge configuration and filter column
add columns to the same edge table. The edge tables are joined to the corresponding case table.
Example
In this example several calls to the SOURCE
and TARGET
operators are done. After each call the status of the edge tables is shown.
SOURCE ( Activity, ANY_OCCURRENCE[] TO ANY_OCCURRENCE[] )
SOURCE
is called with theANY_OCCURRENCE[] TO ANY_OCCURRENCE[]
edge configuration. A temporary edge table is created and theSOURCE
column is added to it:TARGET ( Activity, ANY_OCCURRENCE[] TO ANY_OCCURRENCE[] )
TARGET
is also called with theANY_OCCURRENCE[] TO ANY_OCCURRENCE[]
edge configuration. As this is the same configuration as in the previousSOURCE
call, theTARGET
column is added to the existing edge table:- This column is added to a new temporary table, because the configuration is different to the previous ones:
SOURCE ( Activity, Filter, ANY_OCCURRENCE[] TO ANY_OCCURRENCE[] )
SOURCE
is called with a filter column configuration. TheFilter
could be a call to REMAP_VALUES as shown above. Although the edge configurationANY_OCCURRENCE[] TO ANY_OCCURRENCE[]
is the same as in the calls before, the overall configuration differs because of the additional filter column. Therefore, a new temporary edge table is created, and theSOURCE
column is added to it:
Multiple temporary edge tables can lead to problems if used within the same query. This happens if SOURCE
and TARGET
operators
with different edge and filter configurations are used within the same query. The edge tables cannot be joined
together by default because they do not have a common parent. This behavior is described in
Join functionality in more detail.
The only way of having multiple different configurations inside the same query is to aggregate values of the temporary edge tables to the case table using PU functions, as shown in the following example:
SOURCE
and TARGET
timestamps
are computed first for each case. Then, the average throughput time value between activities per case is pulled to the
case table using PU_AVG.
When using multiple SOURCE
/ TARGET
configurations inside the same query, the automatic configuration propagation
does not take place, meaning that the desired configuration has to be passed to every SOURCE
and TARGET
occurrence:
Column1
"CASES"."CASE_ID"
Column2
PU_AVG ( "Cases" , MINUTES_BETWEEN ( SOURCE ( "ACTIVITIES"."TIMESTAMP" , REMAP_VALUES ( "ACTIVITIES"."ACTIVITY" , [ 'A' , NULL ] ) ) , TARGET ( "ACTIVITIES"."TIMESTAMP" , REMAP_VALUES ( "ACTIVITIES"."ACTIVITY" , [ 'A' , NULL ] ) ) ) )
Column3
PU_AVG ( "Cases" , MINUTES_BETWEEN ( SOURCE ( "ACTIVITIES"."TIMESTAMP" , REMAP_VALUES ( "ACTIVITIES"."ACTIVITY" , [ 'B' , NULL ] ) ) , TARGET ( "ACTIVITIES"."TIMESTAMP" , REMAP_VALUES ( "ACTIVITIES"."ACTIVITY" , [ 'B' , NULL ] ) ) ) )
Filter Behavior
As described above, the temporary edge tables are joined to the case table. This means that if a case is filtered out in the case table, all activities related to that case in the temporary edge tables are also filtered out. This also holds vice versa: if all activities related to a case are filtered out in the edge table, the case itself is also filtered out in the case table. If not all edges of a case are filtered out, the case stays in the case table. Filter Propagation describes the filter propagation behavior in more detail.
SOURCE
activity
equals 'A'. Since case 2 does not contain a SOURCE
activity 'A', it is filtered out completely by the filter. Case 1
however contains an edge with an 'A' as the SOURCE
activity (edge 'A' to 'B'), which is why case 1 appears in the
result:
Filter
FILTER SOURCE ( "ACTIVITIES"."ACTIVITY" ) = 'A';
Column1
"CASES"."CASE_ID"