CLUSTER_STRINGS clusters strings in a column based on their similarity.
CLUSTER_STRINGS clusters strings in a given column that are similar according to the so-called edit distance (also sometimes known as Levenshtein distance) and a given threshold. The result of the operator is a column where each string is replaced by a representative of the assigned cluster.
The edit distance is the minimum number of edit operations that is required to transform one string into another by inserting, deleting or replacing a character. E.g., it takes 1 operation to transform "is" to "his" (insertion of "h"), 1 operation to transform "bad" to "bar" (replacing "d" by "r"), and 2 operations to transform "hteir" to "their" (delete "h", insert "h" after "t").
The representative of the cluster is the most frequent string in the cluster. If there is no unique candidate, the candidate that sorts first according to lexicographical order is chosen.
The weight of characters can be changed, e.g., all numerals can be mapped to a heigher weight. The main purpose of this parameter is to allow giving numbers a higher trust: The items "Paper 90g" and "Paper 80g" are likely to actually mean different types of paper, so these two strings should be in different clusters. However, the two strings "Paper 90g" and "Papre 90g" most likely differ just because of a typo in the latter, so they should be in the same cluster.Clustering is transitive: If the two strings
s2 as well as the two strings
s3 are found to be similar, then
s3 will end up in the same cluster, although
s3 may not be similar.
- column: String column.
- edit_distance: The edit distance.
- weighted_tokens: String of tokens for which an edit operation should have a user defined cost.
- weight: Cost for these tokens.
- partition_column: Optional partition column to specify groups in which
CLUSTER_STRINGSshould search for similar strings.
CLUSTER_STRINGS ignores NULL values. If the column contains only NULL values the result will also be a column containing only NULL values. If there are NULL values and non NULL values, it is guaranteed that no non NULL value is similar to the NULL value.
The partition columns specify groups. The
CLUSTER_STRINGS function operates independently within
every group. Similar groups are not considered to be the same.