Current Activities
Fuzzy Breast Cancer Diagnosis
Brian T. Luke

The Wisconsin Breast Cancer Data Set used digitized images of stained nuclei from 569 breast tumors. Each image was examined to produce a set of 30 descriptors (metrics). Ten features (radius, perimeter, etc.) were examined for each isolated nucleus, and the average, standard deviation, and maximum value of each was determined. In addition, a separate determination of malignant or benign for each tumor was made.

The creators of this data set used a Machine Learning procedure to make perdictions. In particular, all possible sets of three descriptors were examined, which produced 3-dimensional descriptor spaces. A plane was passed though the data and optimized to maximize the separation of malignant and benign tumors.

The best set of descriptors produced an incorrect diagnosis 3.5% of the time. This is very good results, but suffers from two draw-backs. The first is that the analysis takes a very long time, and the second is that for cancer, being wrong 3.5% of the time may be too much.

I used Fuzzy Clustering to examine this set. I ran 10,000 6-member cross-validation studies. For each 563-member training set, the 30 descriptors were ordered from best to worst in their ability to distinguish malignant from benign. A threshold value was used to determine the number of descriptors to use (15 for this data set). A linear combination of these descriptors was created so that the differentiation between malignant and benign was maximized.

This single metric was then used in a fuzzy clustering. When a membership threshold of 80% was used, a not sure response was returned 4% of the time, but the procedure was wrong less than 1% of the time. A not sure verdict simply means that further tests should be done before proceeding.

All 10,000 samples were run in less than one CPU-minute on a Power2, 591 node of an IBM SP2. Therefore, this procedure is able to quickly produce fuzzy sets that significantly reduce the number of incorrect diagnoses at the expense of returning a not sure diagnosis 4% if the time.