MINIMAL PANELS OF RNA MARKERS FOR CELL TYPES USING SINGLE-CELL DATA
View/ Open
JI-DISSERTATION-2022.pdf (8.648Mb) (embargoed until: 2026-05-01)
Date
2022-03-18Author
Ji, Lanlan
Metadata
Show full item recordAbstract
Single-cell RNA sequencing technologies provide measurements of the number of RNA
molecules in many thousands of individual cells, a rich source of information for
determining attributes of cell populations, such as cell types and the variation in gene
expression from cell to cell, which are not available from bulk RNA sequencing data
[1–5]. A core challenge in the analysis of sc-RNA seq data is to find “marker genes” for
some class of cells, e.g., cell type. Another challenge is to describe, let alone quantify,
how the individual marker genes cooperate to determine cell labels. Generally, most
existing methods of scRNA-seq analysis are at the univariate (single gene) level even
though the relevant biology is often decidedly multivariate.
In this thesis we introduces a method that formulates marker gene selection as
a variation of the well-known “minimal set-covering problem” in combinatorial optimization. Here, the “covering” elements are genes and the objects to be covered are a
sub-population of cells with a particular label k. In order to draw this link between
marker panels and set coverings, we binarize the raw mRNA counts into “expressed”
(positive count) or “not expressed” (zero count). The resulting paradigm, based on
covering a target class, differs fundamentally from most standard approaches, in which
optimal panels are determined by optimizing their weights with a fixed panel size. In
addition to enabling the link to set covering, binarization facilitates the biological interpretation of marker genes and the manner in which they characterize and discriminate
among types of cells. Using the covering paradigm, we can predict cell types or transfer marker panels to identify shared cellular processes across data sets in related biological
contexts using extremely transparent discriminants, such as the number of expressed
panel genes. We illustrate this new methodology in the context of neocortical neurogenesis during mid-gestation when the vast majority of neurons in the brain are produced.
To further investigate some basic properties of covering marker panels, we also discuss
the stability of covering marker sets, as well as the gene interactions within a marker
set. Some generalizations and extensions of the covering algorithm are also introduced.
We also present a semi-supervised learning version of marker panel construction when
cell labeling is incomplete or some marker genes are known. Finally, we introduce a
marker panel based on pairs of genes which characterizes the transitions between cell
states.