ClassCK
Classifier Construction Kit
Brian T. Luke (lukeb@ncifcrf.gov)

Menu
Home

Input
Control File
Dataset Structure

Background
Feature Selection
Distance Metrics
Classification Methods
Model Verification

Downloads
ClassCK v1.0

Program Notes
Installation Notes
Supplied Test Cases
Program Structure
Subroutine List
Source Code

Goal:
omic-investigations (Proteomics, Metabolomics, Metabonomics,...) generate a large amount of data on a relatively small number of samples. There is a growing interest in using this data to distinguish one Class of samples from another (i.e. healthy versus diseased or prostate cancer versus kidney cancer). The Classifier Construction Kit (ClassCK) is a collection of routines that allows the user to construct classification models using their own or publically available datasets.

A classifier, as constructed here, uses a small number of features, a distance metric, and a procedure that predicts the Class of an unknown sample by comparing it to group of known samples with similar feature values. Each classifier is given a score based on how well it determines the Class of a set of training samples, or how compact the resulting clusters are (assuming a clustering classification method is used), or a combination of the two. Given a distance metric and a classification method, ClassCK searches for the best set of features.

ClassCK uses a modified Evolutionary Programming method to search for the best set of features. Nine different distance metrics are available.

  1. A Manhattan distance
  2. A Euclidean distance
  3. A Chebyschev distance
  4. A (1-r) distance [r is the Pearson's correlation coefficient]
  5. A (1-cos(θ)) [θ is the angle between to sample-vectors]
  6. A θ-distance [θ is the angle between to sample-vectors]
  7. A Canberra distance
  8. A Squared Chord distance
  9. A Squared Chi-squared distance

In addition, there are six classification methods available.

  1. Distance-Dependent K-Nearest Neighbors (DD-KNN)
  2. K-Means Clustering
  3. Single Linkage Clustering
  4. Average Linkage Clustering
  5. Complete Linkage Clustering
  6. Distance-Dependent Jarvis-Patrick Clustering

For a given distance metric and classification method the Evolutionary Programming driver searches for the best set of features. At the conclusion of the feature search the top classifiers can be examined by one or more of the following methods.

  • A Jackknife Analysis (if there is no testing dataset)
  • A Bootstrap Analysis (if there is no testing dataset)
  • Calculation of Average Silhouette Width (on clusters formed with the training dataset(s))
  • Calculaton of the Kelley Penalty (on clusters formed with the training dataset(s))
  • A Receiver Operating Characteristic (ROC) Analysis

In addition, ClassCK is able to produce PostScript(tm) files that contain the following plots:

  • A Parallel Coordinate Plot of cluster centroids
  • A histogram plot of DD-KNN prediction probabilities
  • A Silhouette Width plot
  • A ROC plot

To learn more about ClassCK, including how to download the program, consult the menu on the left.

Disclaimer
ClassCK is freely available on an "as-is" basis. It comes with no warranties or guarantees, except that the author guarantees that certain combinations of distances metrics and classification models will not work. This work was funded in whole or in part with federal funds from the U.S. National Cancer Institute, National Institutes of Health, under contract no. NO1-CO-12400. The contents of this program and associated documentation does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does any mention of trade names, commercial products or organizations imply endorsement by the U.S. Government.