BCC2020 has ended
➞ Set your timezone before doing anything else on this site (home page, on the right)
Limit what is shown by Type, Category, or Hemisphere
Registration closed July 15.

BCC2020 is online, global, and affordable. The meeting and training are now done, and the CoFest is under way.

The 2020 Bioinformatics Community Conference brings together the Bioinformatics Open Source Conference (BOSC) and the Galaxy Community Conference into a single event featuring training, a meeting, and a CollaborationFest. Events run from July 17 through July 25, and is held in both the eastern and western hemispheres.

Back To Schedule
Tuesday, July 21 • 12:55 - 13:00
ViPRA-Haplo: de novo reconstruction of viral populations using paired end sequencing data 🍐

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!


The presenter(s) will be available for live Q&A in this session (BCC West).
You are welcome to Poster P7-01 for more details.

Weiling Li 1, Raunaq Malhotra 2, Steven Wu 3, Manjari Jha 4, Allen Rodrigo 5, Mary Poss 6, and Raj Acharya 7

1 Indiana University, Bloomington, IN. Email: wli6@iu.edu
2 GNS Healthcare, Cambridge, MA. Email: rmalhotra@gnshealthcare.com
3 BioConsortia, Davis, CA. Email: stevenhwu@gmail.com
4 Microsoft, Redmond, WA. Email: manjari.mu@gmail.com
5 The University of Auckland, Auckland, New Zealand. Email: a.rodrigo@auckland.ac.nz
6 The Pennsylvania State University, University Park, PA. Email: maryposs@gmail.com
7 Indiana University, Bloomington, IN. Email: racharya@iu.edu

Project Website: https://github.com/raunaq-m/MLEHaplo
Source Code: https://github.com/raunaq-m/MLEHaplo
License: (BSD2-ClauseLicense)

Main Text of Abstract

Viruses replicating within a host exist as a collection of closely related genetic variants known as viral haplotypes. The diversity in a viral population, or quasispecies, is due to mutations (insertions, deletions or substitutions) or recombination events that occur during virus replication. These haplotypes differ in relative frequencies and together play an important role in the fitness and evolution of the viral population. This variation in viral sequences poses a challenge to vaccine design and drug development. We present ViPRA-Haplo, a de novo assembly algorithm for reconstructing viral haplotypes in a virus population from paired-end next generation sequencing (NGS) data. The proposed Viral Path Reconstruction Algorithm (ViPRA) generates a subset of paths from a De Bruijn graph of reads using the pairing information of reads. These paths represent contigs of the virus. The paths generated by ViPRA are an over-estimation of the possible contigs. We then propose two methods to obtain an optimal set of contigs representing the viral haplotypes. The first method uses VSEARCH to cluster the paths reconstructed by ViPRA. The centroid in each cluster represents a contig. Second, we proposed a method MLEHaplo that generates a maximum likelihood estimate of the viral populations using the ViPRA paths. We evaluate and compare ViPRA-Haplo on a simulated data set, on a real HIV MiSeq data set (SRR961514) with sequencing errors, and on an emerging SARS-CoV-2 real data set (SRR10903401). In the simulated data, ViPRA-Haplo reconstructs full length viral haplotypes having a 99.7% sequence identity to the true viral haplotypes at 250x sequencing coverage. In the real NGS data, error correction software Karect is used to improve de novo assembly. The real HIV data set contains 714,994 pairs (2x250 bp) of reads that cover the five strains to 20,000x. Our method can reconstruct contigs that cover over 90% of each strain of the reference genomes, which is higher than the benchmark tool PEHaplo. In the SARS-CoV-2 data, after filtering for SARS-CoV-2 contigs using the metagenomic classifier Centrifuge, the contigs reconstructed by our method cover over 99% of the reference genome. The comparisons on both simulated and real data show that ViPRA-Haplo outperforms the existing tools by a higher coverage in reference genome(s), and in retaining the variation in viral sequence present naturally in the viral population.


Weiling Li

postdoc, Indiana University - Bloomington

Tuesday July 21, 2020 12:55 - 13:00 EDT