Microarray-based Multiclass Classification using Relative Expression Analysis
MetadataShow full item record
Microarray gene expression profiling has led to a proliferation of statistical learning methods proposed for a variety of problems related to biological and clinical discoveries. One major problem is to identify gene expression-based biological markers for class discovery and prediction of complex diseases such as cancer. For example, expression patterns of genes are discovered to be associated with phenotypes (e.g., classes of disease) through statistical learning models. Early hopes that well-developed methods such as support vector machines would completely revolutionize the field have been moderated by the difficulties of analyzing microarray data. Hence, new and effective approaches need to be developed to address some common limitations encountered by current methods. This thesis is focused on improving statistical learning on microarray data through rank-based methodologies. The relative expression analysis introduced in Chapter 1 is the central concept for methodological development where the relative expression ordering (i.e., the relative ranks of expression levels) of genes is investigated instead of analyzing the actual expression values of individual genes. Supervised learning problems are studied where classification models are built for differentiating disease states. An unsupervised learning task is also examined in which subclasses are discovered by cluster analysis at the molecular level. Both types of problems under study consist of multiple classes. In Chapter 2, a novel rank-based classifier named Top Scoring Set (TSS) is developed for microarray classification of multiple disease states. It generalizes the Top Scoring Pair (TSP) method for binary classification problems to the multiclass case. Its main advantage lies in the simplicity and power of its decision rule, which provides transparent boundaries and allows for potential biological interpretations. Since TSS requires a dimension reduction in the training process, a greedy search algorithm is proposed to perform a fast search over the feature space. In addition, ensemble classification based on TSS is also investigated. In Chapter 3, an efficient and biologically meaningful dimension reduction for the TSS classifier is introduced using the publicly available pathway databases. Pre-defined functional gene groups are analyzed for microarray classification. The pathway-based TSS classifier is validated on an extremely large cohort of leukemia cancer patients. Also, the unsupervised learning ability of relative expression analysis is studied and a rank-based clustering approach is introduced to identify molecularly distinct subtypes of breast cancer patients. Based on the clustering results, the TSP classifier is used for predicting the subtypes of individual breast cancer tumors. These rank-based methods provide an independent validation for the current identification of breast cancer subtypes. Overall, this thesis provides developments and validations of statistical learning methods based on relative expression analysis.