High-throughput detection, annotation, and interpretation of genomic variants using next-generation sequencing data

Kim, Dewey

High-throughput detection, annotation, and interpretation of genomic variants using next-generation sequencing data

Files

KIM-DISSERTATION-2016.pdf (3.62 MB)

Embargo until

2020-05-01

Date

2015-12-04

Authors

Kim, Dewey

Publisher

Johns Hopkins University

Abstract

Recent advances in sequencing technology have made it feasible to use next-generation sequencing (NGS) to characterize the genomic landscape of an individual. Unlike microarrays, which are usually only capable of detecting variants that have been previously discovered in a population, NGS is capable of discerning both common and rare de novo variants. Sequencing studies that involve the analysis of rare variants in human disease typically follow three steps: variant calling, where variants in NGS data are identified, variant annotation, where biologically relevant features are attached to each variant, and variant interpretation, where statistical and machine learning methods are used to prioritize putative functional variants. In this thesis, I attempt to apply and improve these methods in the context of cancer and schizophrenia. Recent matched tumor/normal whole exome sequencing studies, coupled with current variant calling tools, have generated large sets of high-confidence genomic variants. A significant proportion of these variants are missense variants of unknown impact. To increase the speed and efficiency of annotating these variants, I helped in the creation of a database of 86 precomputed disease-relevant features for all possible missense variants in the human exome. This tool allows for near-instantaneous annotation of any variant dataset. A common limitation of using variant effect prediction software is the limited ability to infer the actual functional impact of a putative mutation. Most bioinformatics tools will only return a value signifying the likelihood that the variant is functional. In an effort to aid in the interpretation of candidate mutations, I created a tool capable of mapping one-dimensional sequence positions and features onto three-dimensional protein positions. Using this database, coupled with an online web interface, an interactive 3D visualization of the variant position in the context of its protein is made possible. I also utilized the NGS variant analysis pipeline to try and uncover novel insights into the genetics of schizophrenia using a matched case/control dataset of 500 postmortem human brain RNA-seq samples. Since RNA-seq is a relatively new NGS technique, I participated in a study to evaluate its technical reproducibility. As part of this study, I found that a new RNA-seq library preparation protocol, involving the depletion of ribosomal RNA using magnetic beads, allows for consistently high detection of intronic reads from pre-mRNAs and of long noncoding RNAs (lncRNA). To evaluate the role of rare non-coding variants in schizophrenia, I developed a pipeline for calling rare variants in RNA-seq data, involving the use of a series of alignment tools, resulting in an 80% decrease in the number of false positives as compared to a standard approach. I also created a pipeline for detecting and analyzing short tandem repeats in RNA-seq data. Using this pipeline, I discovered statistically significant alterations in intronic short tandem repeats within genes involved in the innate immune response.

Keywords

Bioinformatics, Genomics, RNA-seq, Next-gen sequencing

URI

http://jhir.library.jhu.edu/handle/1774.2/60599

Collections

ETD -- Doctoral Dissertations

Full item page