Skip to main content

CPM46

Delta load

A delta load can significantly reduce the amount of data that has to be fetched from a source DB during the data model load.

To set up a delta load, Celonis 4 allows partitioning of the input tables. Having split the input tables into partitions, Celonis 4 then only fetches those partitions from the DB that it identified as changed since the last load. For identifying if the data inside partitions changed, Celonis 4 currently provides two strategies: "Timestamp Delta Load" and "Checksum Delta Load". For each table in the data model, the activation and selection of the partitioning strategy can be done in the table configuration of the data model editor:

The "Delta Load table" checkbox is available if the table is loaded from a database. As soon as it is checked, the "Delta Load" tab in the menu bar on the top appears:

34439912.png

In the Delta Load configuration tab, one of the available strategies can be chosen:

34439911.png
Timestamp delta load

This strategy partitions a table along a timestamp column. When a data model load is triggered in Celonis 4, this strategy checks for each partition if the number of entries, the maximum value or both changed. The retrieved values are called checksums and compared to the values from the previous load. If a checksum changes, the corresponding partition is reloaded from the source database.

Pro

Con

+ Easy to configure

  • Might miss updates

+ Applicable in many cases

  • Checksums can be expensive to calculate on source DB

Example

In the example below, the column "Eventtime" is used as the timestamp delta load column. Here, the partitioning is based on month and we use the count of timestamps for the checksum.

First load

We first load this table from the source DB.

34439910.png

During the first data model load, checksum labels are created in the background (not visible for the user). The checksum is here the count of rows per month.

34439909.png
Second load

Now the input data on the source DB changed with an additional entry.

34439908.png

When we trigger the data model load again, the checksums in the background (not visible for the user) are recalculated. The count per month changes for the month October from 2 to 3. The checksum for November stays the same as no value for November was added.

34439907.png

Now, only the data of the table that is in the month October is updated in Celonis, the data for which the Eventtime is in November is not updated. Even if the activity name in November changed in the source DB from "C" to "F", this value would not be updated with the defined Delta Load because the row count did not change.

Checksum delta load

To be more flexible, the checksum strategy allows you to define yourself which partitions are reloaded. The strategy relies on a separate checksum table. In this table you define the partitions and its checksums.

If you want a partition from the source table to be reloaded during the data model load, you only have to change the checksum value from the partition table. A data model reload will only load those partitions for which the checksum value of the partition key changed. The exact value of the checksum is not important.

Example

In this example, the source table is partitioned by country code:

Source table

Customer

City

Country

Customer A

Bristol

GB

Customer B

Paris

FR

Customer C

London

GB

Partition table

Partition key

Checksum

GB

1

FR

1

If in this example the checksum for FR is set to 2 and a reload from source is triggered, all customers from France are reloaded. The customers from Great Britain will not be updated.

The partition table is not managed by Celonis 4 and has to be maintained by the user. Also it is noteworthy that partitions which are not mentioned in the partition table are not fetched from the source database and are therefore not available in Celonis 4. In the example above, customers from Spain would not be loaded because there is no partition key ES in the partition table.

Pro

Con

+ Very flexible

  • User needs to implement logic to maintain partition table

Queries sent to the source database

In Celonis 4.6, two queries are executed against the source database for loading a table. The first query is used to identify the partitions which have changed. The second query is used to load the changed partitions from the database.

Identify changed partitions

The query to identify the changed partitions depends on the chosen strategy.

Checksum:

SELECT $CHECKSUM_TABLE.$CHECKSUM_COLUMN, $CHECKSUM_TABLE.$KEY_COLUMN FROM $CHECKSUM_TABLE

The query fetches the KEY and the CHECKSUM column from the CHECKSUM_TABLE. It checks for each partition/KEY if the checksum has changed. To be able to perform this check, the checksums of the last load are stored on disk.

Timestamp:

SELECT <ROUNDED_TIMESTAMP>, <CHECKSUM_AGGREGATOR> FROM $SOURCE_TABLE GROUP BY <ROUNDED_TIMESTAMP>
<ROUNDED_TIMESTAMP> = [EXTRACT(MINUTE FROM $TIMESTAMP_COLUMN) | EXTRACT(MINUTE FROM $TIMESTAMP_COLUMN) || EXTRACT(HOUR FROM $TIMESTAMP_COLUMN) | EXTRACT(MINUTE FROM $TIMESTAMP_COLUMN) || EXTRACT(HOUR FROM $TIMESTAMP_COLUMN) || EXTRACT(DAY FROM $TIMESTAMP:COLUMN) | ...] // This is database specific
<CHECKSUM_AGGREGATOR> = [COUNT($TIMESTAMP_COLUMN) | MAX($TIMESTAMP_COLUMN) | COUNT($TIMESTAMP_COLUMN) || MAX($TIMESTAMP_COLUMN)]

This query groups the SOURCE_TABLE by rounding the timestamps. It calculates a checksum by using MAX , COUNT, or both combined. The part for rounding the timestamps is database specific because there is no way which works on all databases. The shown example uses the syntax which works on most databases, but there is for example a special syntax for SAP HANA in place.

An example for such a query is:

SELECT EXTRACT(MINUTE FROM Activity_Table.Timestamp) || EXTRACT(HOUR FROM Activity_Table.Timestamp), COUNT(Activity_Table.Timestamp) FROM Activity_Table GROUP BY EXTRACT(MINUTE FROM Activity_Table.Timestamp) || EXTRACT(HOUR FROM Activity_Table.Timestamp)
Load changed partitions

After the changed partitions are identified, the delta load executes a query to load them. All changed partitions are loaded in by one query:

SELECT $SOURCE_TABLE.$SOURCE_COLUMN1, $SOURCE_TABLE.$SOURCE_COLUMN2, ... FROM $SOURCE_TABLE WHERE $SOURCE_TABLE.$KEY_COLUMN IN($PARTITION_KEY1, $PARTITION_KEY2, ...) ORDER BY $SOURCE_TABLE.$KEY_COLUMN

The load is filtered down to the changed partitions by the IN clause. The result needs to be sorted so that Celonis 4 can process it.

Comparison against standard load

A standard load executes a single query, which looks like this:

SELECT $SOURCE_TABLE.$SOURCE_COLUMN1, $SOURCE_TABLE.$SOURCE_COLUMN2, ... FROM $SOURCE_TABLE WHERE $SOURCE_TABLE.$KEY_COLUMN

This query is simpler and also no additional query to identify the changed partitions is necessary. It is crucial for the performance of the delta load that the query optimizer of the source database is able to create a query plan which efficiently makes use of the IN where clause. If that is not the case, the delta load can take longer then a standard load even though less data is loaded from the source database. Especially for complex views, it can be hard for the query optimizer to make the right decisions. Therefore in case of a bad performance of the delta load, we advise consulting your database expert.