Applies to:  CELONIS 4.3 CELONIS 4.4 CELONIS 4.5 CELONIS 4.6 CELONIS 4.7 

Description

Create a decision tree, learn classification rules based on training data, and use it to classify new data rows.

Like other machine learning functions (such as KMEANS or LINEAR_REGRESSION), the DECISION_TREE function consists of two steps.

The first step consists of training the decision tree model. This will learn the classification rules of the decision tree (see the Wikipedia article on decision tree) based on the provided training data. In this step, the user can specify which columns the model should use, and on which subset of the data the system should be trained on. Additionally, an extra max_depth parameter can be provided in order to avoid overfitting to the training data.

In a second step, once the model is trained, the user can use it to classify new rows based on the model input columns.

Syntax

DECISION_TREE ( TRAIN_[FILTERED_]DT ( [ EXCLUDED ( table.exclude_column, ... ), ] INPUT ( table.input_column, ... ), OUTPUT ( table.output_column ) [, max_depth ] ), PREDICT ( table.predict_column, ... ) ) 
  • TRAIN_DT: Constructs the decision tree, while ignoring any given filter(s). Authorization objects are still respected.
  • TRAIN_FILTERED_DT: Constructs the decision tree, filtering the training data with current filters.
  • EXCLUDED: One or more dimension columns (columns that are not aggregations) that should be used as grouping columns for the INPUT columns of the model training. These columns will be ignored by the training algorithm and will not be part of the trained model, unless the columns are also specified in the INPUT.
  • INPUT: One or more columns, which is used to train the model.
  • OUTPUT: One column containing the labels associated to each row that the model should train upon.
  • max_depth: This specifies the maximal depth allowed for the learned tree, this can be used to prevent overfitting. The default value is max_depth=0, which means no limit will be set on the depth of the tree.
  • PREDICT: One or more columns, which the rules of the trained decision tree should be applied upon. The number of PREDICT columns must be the same as the number of INPUT columns. The columns should be semantically the same as the ones provided as the model INPUT parameter, for the results to make sense. Any mix of FLOAT and INT columns in INPUT and PREDICT will return meaningful results. Although other mixes of data types will be accepted by the DECISION_TREE function, the results should not be used.

All columns in TRAIN_DT have to be joinable. The columns in PREDICT do not have to be joinable with the columns in TRAIN_DT.

The input of the model training is regarded as an independent sub query. This means if an aggregation is used, it is independent of the dimensions defined in the rest of the query. This also means that the columns within TRAIN_DT have to be joinable, but they do not have to be joinable with the columns used in the rest of the query.

NULL handling

  • If a row contains a NULL value, the value is ignored and does not affect the model.
  • If a PREDICT row contains a NULL value, the result for that row will be NULL.

Filter behavior

Standard Decision Tree

If rows of a column are filtered, it does not affect the decision tree, as long as the decision tree model is not trained on aggregation results. This means independent of filters and selections, the underlying model stays the same. If some of input data of a model should be restricted, a CASE WHEN statement can be used to map the values that should be ignored to NULL.

If a model is trained on results of an aggregation it still changes with the filtering because the result of the aggregation is affected by the filtering.

Filtered Decision Tree

If a filter or selection changes, the model is retrained and the resulting function adopts to the new of view of data. This may have a large impact on performance.

Result

DECISION_TREE splitting rules are constructed to reduce the entropy within subsets of the data created at each splitting point.

Examples


[1]

In this example two unconnected tables are used. The "Train" table provides data to train the decision tree model. The table also contains output labels for the different branches in the resulting model. The data in the "Predict" is processed by the model, and the rows are labelled as configured.

There no max_depth configured for this example, hence the depth is unbounded. In this case the result is a decision tree of depth 7:


Query
Column1
"Predict"."Data"
Column2
DECISION_TREE (
    TRAIN_DT ( INPUT ( "Train"."Data" ) , OUTPUT ( "Train"."Label" ) ) ,
    PREDICT (
        "Predict"."Data"
    )
)
Input
Predict
Data : INT
-13
-6
1
8
15
22
29
36
43
50
Train
Data : INTLabel : STRING
-15
'Very Bad'
10
'Bad'
20
'Good'
21
'Excellent'
24
'Good'
35
'Bad'
45
'Very Bad'
Output
Result
Column1 : INTColumn2 : STRING
-13
'Very Bad'
-6
'Very Bad'
1
'Bad'
8
'Bad'
15
'Good'
22
'Excellent'
29
'Good'
36
'Bad'
43
'Very Bad'
50
'Very Bad'



[2]

In this example a FILTER is applied to the training input data, meaning only data points with value below 30 are used in the model training. This restriction changes the prediction. Note that the current or any given filter is only applied in the DECISION_TREE query, when TRAIN_FILTERED_DT is used.

The FILTER restriction results in this decision tree with depth 4:


Query
Filter
FILTER "Train"."Data" < 30;
Column1
"Predict"."Data"
Column2
DECISION_TREE (
    TRAIN_FILTERED_DT ( INPUT ( "Train"."Data" ) , OUTPUT ( "Train"."Label" ) ) ,
    PREDICT (
        "Predict"."Data"
    )
)
Input
Predict
Data : INT
-13
-6
1
8
15
22
29
36
43
50
Train
Data : INTLabel : STRING
-15
'Very Bad'
10
'Bad'
20
'Good'
21
'Excellent'
24
'Good'
35
'Bad'
45
'Very Bad'
Output
Result
Column1 : INTColumn2 : STRING
-13
'Very Bad'
-6
'Very Bad'
1
'Bad'
8
'Bad'
15
'Good'
22
'Excellent'
29
'Good'
36
'Good'
43
'Good'
50
'Good'



[3]

In this example a non-default value for max_depth is used in the model training. The value 3 is used in the query, which is then the maximal depth allowed for the learned tree. In the default case, there is no limit on maximal depth. The limitation on the decision tree depth, results in this decision tree:


Query
Column1
"Predict"."Data"
Column2
DECISION_TREE (
    TRAIN_DT ( INPUT ( "Train"."Data" ) , OUTPUT ( "Train"."Label" ) , 3 ) ,
    PREDICT (
        "Predict"."Data"
    )
)
Input
Predict
Data : INT
-13
-6
1
8
15
22
29
36
43
50
Train
Data : INTLabel : STRING
-15
'Very Bad'
10
'Bad'
20
'Good'
21
'Excellent'
24
'Good'
35
'Bad'
45
'Very Bad'
Output
Result
Column1 : INTColumn2 : STRING
-13
'Very Bad'
-6
'Very Bad'
1
'Bad'
8
'Bad'
15
'Bad'
22
'Bad'
29
'Bad'
36
'Bad'
43
'Very Bad'
50
'Very Bad'


  • No labels