Description
CLUSTER_STRINGS clusters strings in a column based on their similarity.
CLUSTER_STRINGS clusters strings in a given column that are similar according to the so-called edit distance (also sometimes known as Levenshtein distance) and a given threshold. The result of the operator is a column where each string is replaced by a representative of the assigned cluster.
The edit distance is the minimum number of edit operations that is required to transform one string into another by inserting, deleting or replacing a character. E.g., it takes 1 operation to transform "is" to "his" (insertion of "h"), 1 operation to transform "bad" to "bar" (replacing "d" by "r"), and 2 operations to transform "hteir" to "their" (delete "h", insert "h" after "t").
The representative of the cluster is the most frequent string in the cluster. If there is no unique candidate, the candidate that sorts first according to lexicographical order is chosen.
The weight of characters can be changed, e.g., all numerals can be mapped to a heigher weight. The main purpose of this parameter is to allow giving numbers a higher trust: The items "Paper 90g" and "Paper 80g" are likely to actually mean different types of paper, so these two strings should be in different clusters. However, the two strings "Paper 90g" and "Papre 90g" most likely differ just because of a typo in the latter, so they should be in the same cluster.
Clustering is transitive: If the two stringss1
and s2
as well as the two strings
s2
and s3
are found to be similar, then s1
, s2
, and
s3
will end up in the same cluster, although s1
and s3
may not be similar.
Syntax
CLUSTER_STRINGS ( table.column , EDIT_THRESHOLD ( edit_distance ) [, WEIGHT ( weighted_tokens , weight ) ] [, PARTITION BY ( partition_column, ...)])
- column: String column.
- edit_distance: The edit distance.
- weighted_tokens: String of tokens for which an edit operation should have a user defined cost.
- weight: Cost for these tokens.
- partition_column: Optional partition column to specify groups in which
CLUSTER_STRINGS
should search for similar strings.
NULL Handling
CLUSTER_STRINGS ignores NULL values. If the column contains only NULL values the result will also be a column containing only NULL values. If there are NULL values and non NULL values, it is guaranteed that no non NULL value is similar to the NULL value.
Examples
CLUSTER_STRINGS
example, using an edit threshold of 2. All input strings are put into the same cluster.
The cluster representative is "chocolate" because it is the most frequent input string:
Column1
CLUSTER_STRINGS ( "TABLE"."NAME" , EDIT_THRESHOLD ( 2 ) )
CLUSTER_STRINGS
example, using an edit threshold of 4. All input strings are put into the same cluster.
The cluster representative is "Harry Potter" because it sorts first according to lexicographical order:
Column1
CLUSTER_STRINGS ( "TABLE"."BOOKS" , EDIT_THRESHOLD ( 4 ) )
Column1
CLUSTER_STRINGS ( "TABLE"."PRODUCT" , EDIT_THRESHOLD ( 5 ) , WEIGHT ( '0123456789' , 10 ) )
Partitioning
The partition columns specify groups. The CLUSTER_STRINGS
function operates independently within
every group. Similar groups are not considered to be the same.
CLUSTER_STRINGS
clusters the strings in each group individually.
In the first group the cluster representative is "Pasta" since it is the first according to lexicographical
order. In the second group the cluster representative is "Pizza" since it is the only element to appear twice.
Column1
CLUSTER_STRINGS ( "TABLE"."NAME" , EDIT_THRESHOLD ( 2 ) , PARTITION BY ( "TABLE"."PARTITION" ) )
CLUSTER_STRINGS
on a joined table.
Column1
CLUSTER_STRINGS ( "Table1"."Item" , EDIT_THRESHOLD ( 2 ) , PARTITION BY ( "Partition"."Group" ) )