CLUSTER_STRINGS

Description

CLUSTER_STRINGS clusters strings in a column based on their similarity.

CLUSTER_STRINGS clusters strings in a given column that are similar according to the so-called edit distance (also sometimes known as Levenshtein distance) and a given threshold. The result of the operator is a column where each string is replaced by a representative of the assigned cluster.

The edit distance is the minimum number of edit operations that is required to transform one string into another by inserting, deleting or replacing a character. E.g., it takes 1 operation to transform "is" to "his" (insertion of "h"), 1 operation to transform "bad" to "bar" (replacing "d" by "r"), and 2 operations to transform "hteir" to "their" (delete "h", insert "h" after "t").

The representative of the cluster is the most frequent string in the cluster. If there is no unique candidate, the candidate that sorts first according to lexicographical order is chosen.

The weight of characters can be changed, e.g., all numerals can be mapped to a heigher weight. The main purpose of this parameter is to allow giving numbers a higher trust: The items "Paper 90g" and "Paper 80g" are likely to actually mean different types of paper, so these two strings should be in different clusters. However, the two strings "Paper 90g" and "Papre 90g" most likely differ just because of a typo in the latter, so they should be in the same cluster.

Clustering is transitive: If the two strings s1 and s2 as well as the two strings s2 and s3 are found to be similar, then s1, s2, and s3 will end up in the same cluster, although s1 and s3 may not be similar.

Syntax

CLUSTER_STRINGS ( table.column , EDIT_THRESHOLD ( edit_distance )
[, WEIGHT ( weighted_tokens , weight ) ]
[, PARTITION BY ( partition_column, ...)])

column: String column.
edit_distance: The edit distance.
weighted_tokens: String of tokens for which an edit operation should have a user defined cost.
weight: Cost for these tokens.
partition_column: Optional partition column to specify groups in which CLUSTER_STRINGS should search for similar strings.

NULL handling

CLUSTER_STRINGS ignores NULL values. If the column contains only NULL values the result will also be a column containing only NULL values. If there are NULL values and non NULL values, it is guaranteed that no non NULL value is similar to the NULL value.

Examples

[1] Simple CLUSTER_STRINGS example, using an edit threshold of 2. All input strings are put into the same cluster. The cluster representative is "chocolate" because it is the most frequent input string

Query

Column1

CLUSTER_STRINGS ( "TABLE"."NAME" , EDIT_THRESHOLD ( 2 ) )

Input

Output

TABLE

NAME : STRING
'chocolate'
'cocolate'
'cholate'
'chocolate'
'chocolaet'

Result

Column1 : STRING
'chocolate'
'chocolate'
'chocolate'
'chocolate'
'chocolate'

[2] Simple CLUSTER_STRINGS example, using an edit threshold of 4. All input strings are put into the same cluster. The cluster representative is "Harry Potter" because it sorts first according to lexicographical order.

Query

Column1

CLUSTER_STRINGS ( "TABLE"."BOOKS" , EDIT_THRESHOLD ( 4 ) )

Input

Output

TABLE

BOOKS : STRING
'Terry Otter'
'Harry Spotter'
'Harry Potter'
'Larry Squatter'

Result

Column1 : STRING
'Harry Potter'
'Harry Potter'
'Harry Potter'
'Harry Potter'

[3] In this example, the edit threshold is set to 5, and the weight for performing an edit to any number character is set to 10. By setting this weight, "FLUX DIFFUSER 9000" and "FLUX DIFFUSER 3000" are not placed inside the same cluster because changing the 9 into a 3 (or vice versa) would cost 10, but only 5 edits are allowed:.

Query

Column1

CLUSTER_STRINGS ( "TABLE"."PRODUCT" , EDIT_THRESHOLD ( 5 ) , WEIGHT ( '0123456789' , 10 ) )

Input

Output

TABLE

PRODUCT : STRING
'FLUX DIFUSER 9000'
'FLUX DIFFUSER 9000'
'FLUX DIFFUSER 3000'

Result

Column1 : STRING
'FLUX DIFFUSER 9000'
'FLUX DIFFUSER 9000'
'FLUX DIFFUSER 3000'

Partitioning

The partition columns specify groups. The CLUSTER_STRINGS function operates independently within every group. Similar groups are not considered to be the same.

[4] With a given partition column CLUSTER_STRINGS clusters the strings in each group individually. In the first group the cluster representative is "Pasta" since it is the first according to lexicographical order. In the second group the cluster representative is "Pizza" since it is the only element to appear twice.

Query

Column1

CLUSTER_STRINGS ( "TABLE"."NAME" , EDIT_THRESHOLD ( 2 ) , PARTITION BY ( "TABLE"."PARTITION" ) )

Input

Output

TABLE

NAME : STRING	PARTITION : INT
'Pasta'	1
'Pista'	1
'Pizta'	1
'Pizza'	1
'Pizza'	2
'Pisza'	2
'Pizza'	2
'Pitza'	2
'Bizza'	2

Result

Column1 : STRING
'Pasta'
'Pasta'
'Pasta'
'Pasta'
'Pizza'
'Pizza'
'Pizza'
'Pizza'
'Pizza'

[5] CLUSTER_STRINGS on a joined table.

Query

Column1

CLUSTER_STRINGS ( "Table1"."Item" , EDIT_THRESHOLD ( 2 ) , PARTITION BY ( "Partition"."Group" ) )

Input

Output

Partition

key : INT	Group : STRING
1	'A'
2	'A'
3	'B'
4	'B'
5	'A'
6	'A'
7	'B'
8	'C'

Table1

key : INT	Item : STRING
1	'AA'
2	'AB'
3	'AC'
4	'AD'
5	'CD'

Foreign Keys

Partition.key

Table1.key

Result

Column1 : STRING
'AA'
'AA'
'AC'
'AC'
'CD'

PQL47 (PQL Function Library - CPM 4.7)

CLUSTER_STRINGS

Description

Syntax

NULL handling

Examples

Partitioning

Search results