MATHEMATICAL ANALYSES OF GENOME COMPLEXITY AND POPULATION DIVERSITY USING NEXT- AND THIRD-GENERATION SEQUENCING TECHNOLOGIES

Embargo until
2022-05-01
Date
2020-12-08
Journal Title
Journal ISSN
Volume Title
Publisher
Johns Hopkins University
Abstract
The advent and continued improvement of DNA sequencing methods promises deeper insights into the genomes of living organisms. The deluge of data from second-generation and third-generation sequencing technologies requires large-scale bioinformatics tools. These tools must account not only for genome complexity within a single individual or species but also for the population diversity across individuals and species. This dissertation presents three tools which leverage mathematical insights and modeling to profile and analyze genome complexity and population diversity. First, GenomeScope applies combinatorial theory to establish a detailed mathematical model of how k-mer frequencies are distributed in heterozygous and polyploid genomes. GenomeScope is able to accurately estimate genomic characteristics such as length, heterozygosity, and repetitiveness even for complex organisms. Second, SVCollector analyzes a population-level VCF file from a low resolution genotyping study and uses a greedy algorithm to compute a ranked list of samples that maximizes the total number of variants present from a subset of a given size. SVCollector is able to run quickly on thousands of samples and allows for a more cost-efficient way to identify and validate variants within large populations. Finally, HetSmoother uses a k-mer based approach to identify sequencing errors and heterozygous regions within sequencing reads. Once the errors are removed, the heterozygous regions are then consistently edited to a single haplotype, which improves the contiguity and reduces the duplication of assemblies for heterozygous genomes.
Description
Keywords
genomics, computational biology, bioinformatics, DNA sequencing, population diversity, mathematical modeling
Citation