Computational Methods for Structural Variation Analysis in Populations

Embargo until
Journal Title
Journal ISSN
Volume Title
Johns Hopkins University
Recent advances in long-read sequencing have given us an unprecedented view of structural variants (SVs). However, much of their role in disease and evolution remains unknown due to a number of technical and biological challenges, including the high error rate of most long-read sequencing data, the additional complexity of aligning around large variants, and biological differences in how the same SV can manifest in different individuals. In this thesis we introduce novel methods for structural variant analysis and demonstrate how they overcome many of these obstacles. First, we apply recent advances in data structures to the substring search problem and show how learned index structures can enable accelerated alignment of genomic reads. Next, we present an optimized SV calling pipeline that integrates improvements to existing software alongside two novel SV-processing methods, Iris and Jasmine, which improve the accuracy of SV breakpoints and sequences in individual samples and compare and integrate SV calls from multiple samples. Finally, we show how the introduction of CHM13, the first gap-free telomere-to-telomere human reference genome, enables for the first time variant calling in over 100 Mbp of newly resolved sequence and mitigates long-standing issues in variant calling that were attributed to gaps, errors, and minor alleles in the prior GRCh38 reference. We demonstrate the broad applicability of our advancements in SV inference by uncovering novel associations with gene expression in 444 human individuals from the 1000 Genomes Project, by detecting SVs in the tomato genome which affect fruit size and yield, and by comparing SVs between tumor and normal cells in organoids derived from the SKBR3 breast cancer cell line.
genomics, computational biology, genetic variation, structural variation, genome sequencing