Skip to main content

PQL46 (PQL Function Library - CPM 4.6)

CLUSTER_VARIANTS

Applies to: CELONIS 4.5 CELONIS 4.6

Description

The CLUSTER_VARIANTS operator groups similar process variants (or traces) into clusters. Variants which cannot be assigned to a cluster are marked as noise.

Warning

Operator Performance

This operator performs very expensive computations and requires excessive memory and CPU resources. To avoid running out of memory, this operator is currently limited to 100,000 distinct variants.

Computation Times

While executing this operator, users may also experience long computation times and unresponsive analyses as it occupies a vast amount of computation capacity.

Syntax
CLUSTER_VARIANTS ( variant_column, MIN_PTS, EPSILON )
  • variant_column: The column which stores the result of the VARIANToperator.

  • MIN_PTS: INT value giving the minimal density of similar variants that is required to create a cluster. Lower values tend to create more clusters, while higher values tend to classify more variants as noise. An estimate of a well-performing parameter value can be computed by the ESTIMATE_CLUSTER_PARAMS operator.

  • EPSILON: INT value giving the search radius for measuring the variant density. It is quantified by the number of different relations between two subsequent activities in the variants. The value must be an integer in the range [0, 5]. The higher the value, the more it is likely that all variants are assigned to the same cluster. It is recommended to choose a quite low value (e.g. 2). However, the value of 0 requires equality between variants to be clustered.

Result: An INT column in which the number represents the ID of the cluster, to which the case (and its variant respectively) has been assigned to. A cluster ID of -1 indicates that the case has been classified as noise and a cluster ID of -2 indicates that the respective case does not have any activities (i.e. the case is empty).

Example

[1] Variant 1 (trace A, B, C of cases 1, 2, 3, and 4) and variant 3 (trace A, B, B, C of case 7) are similar and grouped into the same cluster (ID 0), while variant 2 (trace A, B, D of cases 5 and 6) forms a cluster on its own (ID 1). The variant 4 (trace X, Y, Z of case 8) is classified as noise due to its missing similarity to other traces and its low frequency (it occurs only once).

Query

Column1

"Cases"."Case"

Column2

VARIANT ( "Activities"."Activity" )

Column3

CLUSTER_VARIANTS ( VARIANT ( "Activities"."Activity" ) , 2 , 2 )

Input

Output

Activities

Case : INT

Activity : STRING

Timestamp : DATE

1

'A'

Fri Jan 01 2016 01:00:00.000

1

'B'

Fri Jan 01 2016 02:00:00.000

1

'C'

Fri Jan 01 2016 03:00:00.000

2

'A'

Fri Jan 01 2016 04:00:00.000

2

'B'

Fri Jan 01 2016 05:00:00.000

2

'C'

Fri Jan 01 2016 06:00:00.000

3

'A'

Fri Jan 01 2016 07:00:00.000

3

'B'

Fri Jan 01 2016 08:00:00.000

3

'C'

Fri Jan 01 2016 09:00:00.000

4

'A'

Fri Jan 01 2016 10:00:00.000

4

'B'

Fri Jan 01 2016 11:00:00.000

4

'C'

Fri Jan 01 2016 12:00:00.000

5

'A'

Fri Jan 01 2016 13:00:00.000

5

'B'

Fri Jan 01 2016 14:00:00.000

5

'D'

Fri Jan 01 2016 15:00:00.000

6

'A'

Fri Jan 01 2016 16:00:00.000

6

'B'

Fri Jan 01 2016 17:00:00.000

6

'D'

Fri Jan 01 2016 18:00:00.000

7

'A'

Fri Jan 01 2016 19:00:00.000

7

'B'

Fri Jan 01 2016 20:00:00.000

7

'B'

Fri Jan 01 2016 21:00:00.000

7

'C'

Fri Jan 01 2016 22:00:00.000

8

'X'

Fri Jan 01 2016 23:00:00.000

8

'Y'

Sat Jan 02 2016 00:00:00.000

8

'Z'

Sat Jan 02 2016 01:00:00.000

Cases

Case : INT

1

2

3

4

5

6

7

8

Foreign Keys

Activities.Case

Cases.Case

Result

Column1 : INT

Column2 : STRING

Column3 : INT

1

'A, B, C'

0

2

'A, B, C'

0

3

'A, B, C'

0

4

'A, B, C'

0

5

'A, B, D'

1

6

'A, B, D'

1

7

'A, B, B, C'

0

8

'X, Y, Z'

-1