Current Activities
Distances in Clustering
Brian T. Luke

For K-Means and Hierarchical Clustering, as well as Diversity Selection, it is necessary to calculate a "distance" between either two objects (one of which may be a cluster seed point) or an object and a group centroid. In this discussion, it is assumed that each object is described by an array of real-valued metrics. A discussion of distance metrics between binary objects is given in Clustering Binary Objects.

Since we need to minimize the largest or average distance, or the variance, during clustering, it is important that each metric contribute equally to the total distance. In other words, if one metric spans the range [0.0,0.5] and another spans [0.0,100.0], the maximum deviation in the first would have little effect on the total distance, while even a modest separation in the second would have a much larger effect. To remove this dependency on the range spanned by each metric, it is important to first standardize the values. This means that each metric, when compared over the full set of objects will have a mean of 0.0 and a variance (or standard deviation) of 1.0. For each metric, the following steps should be taken.

  1. Sum the values of the metric over all objects and divide the sum by the number of objects.
  2. Subtract this average value from the metric in all objects.
  3. Sum the square of these new values over all objects, divide the sum by the total number of objects, and take its square-root. This is the standard deviation of the new values.
  4. Divide the metric by the standard deviation in each object.
This process standardizes each metric describing the objects. In some cases, such as calculating the squared chord distance (described below), it may be necessary to have non-negative values for all metrics. This can be done for each metric by adding the step of finding the minimum value of this metric over all objects, and subtracting this minimum value from the metric in all objects.

If each object is described by a real-valued array of metrics of length K, the ith object Xi has the array {xki} for k=1,2,...,K.

The general form for the distance, which is called the LN norm, between object i and object/centroid j is

LNij = [  |xki - xkj|N ]1/N

For K=2, the region contained within a threshold distance for various values of N is shown below. When N=1, L1ij is known as the city-block or Manhattan distance; while for N=2, L2ij is the Euclidean distance. One can see from the pictures below, as N increases the enclosed region takes on more the shape of a square. For an infinitely large value of N, this would be a perfect square. This Linf norm is called the Chebychev distance.


N = 1

N = 2

N = 3

N = 4

N = 20

Other distance measures that may be used are given in the following table.

Distance Measure Equation
Canberra distance
 D =  [ |xki - xkj| / |xki + xkj| ]
Squared chord distance
Squared Chi-squared distance
 D =  [ (xki - xkj)2 / |xki + xkj| ]

Though it is required for the Squared chord distance, it is highly recommended that the metrics be shifted to non-negative (or even positive) values before calculating any of these distances. Otherwise, it is possible to have infinite Canberra and Squared Chi-squared distances.

Another method of calculating a distance is to first measure the similarity between two objects. This similarity, S, generally takes on values between 0.0 (completely dissimilar) and 1.0 (identical). The distance can then be given by A(1-S)M, where M is a non-negative exponent and A is a constant.

If the objects to be clustered are identified by L descriptors, the object X contains the descriptors {xi} and object Y contains {yi} (for i=1,2,...,L). The cosine similarity function [1] between these objects, CSxy, treats the set of descriptors as components of an M-dimensional vector, and the similarity is the cosine of the angle between these vectors (their dot product divided by their magnitudes). This similarity, which is also known as the Ochini coefficient, is given by the expression

As the objects become more similar, CSxy approaches 1.0 and their distance approaches 0.0. While the cosine of this angle is a similarity measure, the arccosine of this value (which is just the magnitude of the angle) can be used as a distance metric.

The cosine similarity function as well as other similarity and distance metrics can be used to cluster binary objects.

One final point is that it is possible to weight each descriptor. If x'ik is the kth descriptor (k=1,2,...L) for object i, it can be weighted by wk. The terms in the above expressions become

xik = wk x'ik

The weight for each bit can be related to some function of its variance. In other words, if this descriptor has approximately the same value for all N objects to be clustered, it may not be a valuable descriminator. Conversely, if it varies considerably amongst the objects, it may be an important descriminator and should be given a relatively large weight. One definition weight is given by the expression

wk = A vk
where vk is the variance of this descriptor amongst all objects and A is a constant.

Reference:
[1] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989.