DISCOVERING TECHNICAL AND BIOLOGICAL DRIVERS OF VARIATION IN BULK RNA-SEQUENCING
Johns Hopkins University
Next Generation Sequencing has ushered in a new era of high-throughput experiments collectively referred to as -Omics. These technologies allow us to understand the genome’s structure and content, the transcriptomes of tissues and cells, and how various proteins interact with both DNA and RNA. Despite their strengths, difficulties and pitfalls have arisen. Understanding their batch effects, how they validate each other, and how gene groups act differently between tools, is necessary for their proper implementation. I used the largest public bulk RNA-sequencing dataset of normal tissues in tandem with other datasets to describe the biological and technical variation of one of these technologies, bulk RNA-sequencing. Using the Genotype Tissue Expression dataset, I describe an undiscovered sequencing contamination. I determine this contamination occurs temporally around sequencing date, most likely during library preparation. I validate the contamination is not endogenous using single nucleotide polymorphisms differences between RNA-sequencing reads and genomic data. Using other bulk and single cell sequencing datasets, I extend these findings to other databases revealing how they can affect the results of experiments. The Human Protein Atlas is a resource that has immunohistochemistry data on 44 non-disease tissues and 15,320 proteins, but it can only query one protein at a time. I developed HPAStainR, a Bioconductor R package and Shiny App that allows multiplexed querying of Human Protein Atlas immunohistochemistry data. We validate the tool’s value against the single cell marker Panglao Database, and recapitulate findings from previous bulk RNA-sequencing experiments. Lastly, we explore the transcriptomic variance of the matrisome, a collection of extra-cellular matrix proteins. We interrogate the differences between tissues matrisome profiles on both transcriptomic and proteomic levels. We show age-related changes in matrisome expression correlate with histological changes and the differences between the sexes in adipose tissues. In non-normal tissue we describe the “generic” matrisome cancer signal, and use GTEx samples to explore changes in idiopathic fibrosis of the lung. Together, I discovered a new form of technical variation, created a tool to utilizing public data that can determine which cell types drive variation, and focused in on a subset of -Omics where variation can occur.
variation, GTEx, RNA-sequencing, bulk sequencing, matrisome, Human Protein Atlas