Sample-based Measures of Dysregulation and Heterogeneity in Cancer Molecular Profiles

Embargo until
Journal Title
Journal ISSN
Volume Title
Johns Hopkins University
Computational models are essential to understand the molecular mechanisms underlying cell function and tissue organization in both health and disease. Clinically relevant molecular signatures derived from high-dimensional omics data represent unprecedented tools to personalize treatment. However, the absence of mechanistic underpinnings for the signatures generated by machine learning algorithms, and the absence of robust, quantitative measures of dysregulation, represent important barriers to successful clinical implementation. This thesis is focused on developing general procedures for representing and distinguishing among disease phenotypes, which embed biological mechanisms into the statistical learning process, and quantify levels of dysregulation and heterogeneity. We introduce a unified theory called “divergence” to convert an omics profile to a digital representation by comparing the profile of an individual to the range of landscapes in a baseline population. The reduction in complexity facilitates a more personalized and biologically interpretable analysis of variation. We introduce several new representations of multi-omics profiles which are highly simplified and yet sufficiently rich to account for observed heterogeneity. Starting from the network of gene-gene interactions existing in Reactome, we build a library of “Source-Target Pairs” (STPs); each consists of a “source” gene and a “target” gene whose expression is plausibly controlled by the source gene. An STP is “aberrant” if source gene is DNA-aberrant and the target gene is RNA-aberrant. To further reduce complexity, we use integer programming to “cover” the disease samples with a set of aberrant STPs, that is to find the smallest family of STPs such that every sample displays at least one aberrant STP within that family. Given such a covering, inter-sample heterogeneity is quantified by the entropy of distribution of covering states over the population. Finally, we develop a prediction model, “weighted voting”, which incorporate gene regulatory network information into the model parameters. The features are aberration states of individual genes and STPs, selected to display sharp differences in distribution among phenotypes of interest. We apply the entire framework to TCGA data from six distinct tumor types, demonstrating that our approach is well-suited to accommodate the expanding complexity of cancer genomes emerging from large consortia projects.
Machine Learning, Computational Medicine, Cancer Research