CLUSTER_STRINGS
Description
CLUSTER_STRINGS clusters strings in a column based on their similarity.
CLUSTER_STRINGS clusters strings in a given column that are similar according to the so-called edit distance (also sometimes known as Levenshtein distance) and a given threshold. The result of the operator is a column where each string is replaced by a representative of the assigned cluster.
The edit distance is the minimum number of edit operations that is required to transform one string into another by inserting, deleting or replacing a character. E.g., it takes 1 operation to transform "is" to "his" (insertion of "h"), 1 operation to transform "bad" to "bar" (replacing "d" by "r"), and 2 operations to transform "hteir" to "their" (delete "h", insert "h" after "t").
The representative of the cluster is the most frequent string in the cluster. If there is no unique candidate, the candidate that sorts first according to lexicographical order is chosen.
The weight of characters can be changed, e.g., all numerals can be mapped to a heigher weight. The main purpose of this parameter is to allow giving numbers a higher trust: The items "Paper 90g" and "Paper 80g" are likely to actually mean different types of paper, so these two strings should be in different clusters. However, the two strings "Paper 90g" and "Papre 90g" most likely differ just because of a typo in the latter, so they should be in the same cluster.
Clustering is transitive: If the two strings s1
and s2
as well as the two strings s2
and s3
are found to be similar, then s1
, s2
, and s3
will end up in the same cluster, although s1
and s3
may not be similar.
Syntax
CLUSTER_STRINGS ( table.column , EDIT_THRESHOLD ( edit_distance ) [, WEIGHT ( weighted_tokens , weight ) ] [, PARTITION BY ( partition_column, ...)])
column: String column.
edit_distance: The edit distance.
weighted_tokens: String of tokens for which an edit operation should have a user defined cost.
weight: Cost for these tokens.
partition_column: Optional partition column to specify groups in which
CLUSTER_STRINGS
should search for similar strings.
NULL handling
CLUSTER_STRINGS ignores NULL values. If the column contains only NULL values the result will also be a column containing only NULL values. If there are NULL values and non NULL values, it is guaranteed that no non NULL value is similar to the NULL value.
Examples
[1] Simple |
[2] Simple |
[3] In this example, the edit threshold is set to 5, and the weight for performing an edit to any number character is set to 10. By setting this weight, "FLUX DIFFUSER 9000" and "FLUX DIFFUSER 3000" are not placed inside the same cluster because changing the 9 into a 3 (or vice versa) would cost 10, but only 5 edits are allowed:. |
Partitioning
The partition columns specify groups. The CLUSTER_STRINGS
function operates independently within every group. Similar groups are not considered to be the same.
[4] With a given partition column
|
[5]
|