| Current Activities |
|
Binary objects are simply objects that are described by a finite-length
bit string. They are often used in chemical databases because they
can be very useful in searching for a particular compound.
If the search string
has a particular bit turned ON and a database compound does not, then
that structure can be passed over. Two possible
representations are Molecular Fingerprints
and Structural Keys.
The bit strings that characterize two objects can also be used to calculate a "distance." This effective distance can then be used with a clustering algorithm to place the objects into groups. One of the most popular methods for calculating the distance between two objects is to first determine their similarity. If a general similarity metric is denoted S, it usually has a value between 0.0 (the bit strings have no ON bits in common) and 1.0 (the two bit strings are identical). A distance can then be given by A(1-S)M, where M is a non-negative exponent and A is a constant. Conversely, each bit string can be thought of as components of an L-dimensional vector and the similarity as just the cosine of the angle between them (which is actually the case for the Ochini similarity). Therefore, a distance can be (A)arcos(S), or just a constant times the value of this angle. If the bit string has a length of L, it is possible to go down this string and count the number of times a bit is ON in both strings, ON in one and OFF in the other, or OFF in both strings. The four sums are presented in the table below.
The first subscript refers to the value of the bit for object i and the second for object j, summed over all L bits. Therefore, B10 is the number of times a bit is ON in i and OFF in j. Some other symbols that will be used in this section are
With these definintions, commonly used similarity metrics are presented in the following table.
The last two metrics are different from the rest. Jaccard's Similarity is different in that it is not symmetrical; the similarity of i to j is different than the similarity of j to i unless the two bit strings have exactly the same number of ON bits (Bi=Bj). This metric is useful if you want to measure a "substructure similarity." If i is a substructure of j, then every bit that is ON in i will also be ON in j though j will also have other bits ON. The other metrics will yield a similarity less than 1.0, while JSi-j produces a similarity of 1.0. Similarly, if j has less bits ON than i, JSj-i can be used to measure a substructure distance. The Tversky Index is unique in that it has two adjustable parameters, a and b. If a=1 and b=0, TI10=JSi-j. If a+b=1, this expression can be rewritten as A dissimilarity metric, D, is a measure of the difference between two bit strings. This metric has a value of 0.0 for identical bit strings and increases as they become less similar. It can therefore be scaled by any constant to yield a distance. A dissimilarity metric can easily be formed by taking D=(1-S), where S is a similarity value from the table above. Certain dissimilarities presented in the literature are given in the table below. They are proportional to either the sum or the product if the disagreement between the bit strings, B10 and B01 (note that B10+B01=L-BI is the Hamming Distance between these bit strings).
In chemical databases, this metrics can yield a significant dissimilarity (distance) if a substructure of a molecule is compared to the full molecule. To remove this problem, Daylight Systems developed a Squared Euclidean Substructure Distance, DSES. If compound i is smaller than compound j (less bits turned ON), DSESi-j may yield a more accurate distance. Similariy, if object j has less bits turned ON than i (Bj < Bi), DSESj-i can be used. With any of these distance metrics, a clustering algorithm can be used to place these binary objects into groups. Since it is not possible to calculate an average position between two binary objects (unless they are identical), one cannot determine a centroid for a group of binary objects. To circumvent this, a medoid [2] can be used instead. A medoid is simply a bit string that minimizes the sum of the distance to all objects in the group, and can be used with Wards Method (described in Agglomerative Linkages) and K-means clustering. In 1987, Peter Willett [3] studied the clustering of chemical databases and found that Jarvis-Patrick clustering [4] produced the best results. I used a similar procedure to try to cluster nucleotide sequences. Because of the large variation in the number of 1's for different bit strings, I found that the following distance metric produced better results. One final point is that it is possible to weight each bit in the string. If bik is the bit in position k (k=1,2,...L) for object i, it can be weighted by wk. The terms in the above expressions become ![]() The weight for each bit can be related to the "inverse frequency" of being turned on. In other words, if this bit is on for all N objects to be clustered, it cannot be used as a descriminator. Conversely, if it is on in only a few of the objects, it may be an important descriminator and should be given a relatively large weight. One such inverse frequency weight is given by the expression Note that the discussion above is limited to a comparison of bit strings that all have the same length. Clustering Character Objects describes metrics for comparing character strings, and character or binary strings of different length.
References: |