| Current Activities |
|
For
K-Means and
Hierarchical Clustering, as well as
Diversity Selection, it is necessary to calculate a "distance"
between either two objects (one of which may be a cluster seed point) or
an object and a group centroid. In this discussion, it is assumed that
each object is described by an array of real-valued metrics. A discussion
of distance metrics between binary objects is given in
Clustering Binary Objects.
Since we need to minimize the largest or average distance, or the variance, during clustering, it is important that each metric contribute equally to the total distance. In other words, if one metric spans the range [0.0,0.5] and another spans [0.0,100.0], the maximum deviation in the first would have little effect on the total distance, while even a modest separation in the second would have a much larger effect. To remove this dependency on the range spanned by each metric, it is important to first standardize the values. This means that each metric, when compared over the full set of objects will have a mean of 0.0 and a variance (or standard deviation) of 1.0. For each metric, the following steps should be taken.
If each object is described by a real-valued array of metrics of length K, the ith object Xi has the array {xki} for k=1,2,...,K. The general form for the distance, which is called the LN norm, between object i and object/centroid j is
For K=2, the region contained within a threshold distance for various values of N is shown below. When N=1, L1ij is known as the city-block or Manhattan distance; while for N=2, L2ij is the Euclidean distance. One can see from the pictures below, as N increases the enclosed region takes on more the shape of a square. For an infinitely large value of N, this would be a perfect square. This Linf norm is called the Chebychev distance.
Other distance measures that may be used are given in the following table.
Though it is required for the Squared chord distance, it is highly recommended that the metrics be shifted to non-negative (or even positive) values before calculating any of these distances. Otherwise, it is possible to have infinite Canberra and Squared Chi-squared distances. Another method of calculating a distance is to first measure the similarity between two objects. This similarity, S, generally takes on values between 0.0 (completely dissimilar) and 1.0 (identical). The distance can then be given by A(1-S)M, where M is a non-negative exponent and A is a constant. If the objects to be clustered are identified by L descriptors, the object X contains the descriptors {xi} and object Y contains {yi} (for i=1,2,...,L). The cosine similarity function [1] between these objects, CSxy, treats the set of descriptors as components of an M-dimensional vector, and the similarity is the cosine of the angle between these vectors (their dot product divided by their magnitudes). This similarity, which is also known as the Ochini coefficient, is given by the expression ![]() The cosine similarity function as well as other similarity and distance metrics can be used to cluster binary objects. One final point is that it is possible to weight each descriptor. If x'ik is the kth descriptor (k=1,2,...L) for object i, it can be weighted by wk. The terms in the above expressions become The weight for each bit can be related to some function of its variance. In other words, if this descriptor has approximately the same value for all N objects to be clustered, it may not be a valuable descriminator. Conversely, if it varies considerably amongst the objects, it may be an important descriminator and should be given a relatively large weight. One definition weight is given by the expression
Reference: |