→ AbstractThe presenter(s) will be available for live Q&A in this session (BCC West).
Rayan Chikhi 1, Kyl Wellman, Steven J. Hallam 2, Anton Korobeynikov 3, Dan Lohr, Robert
C. Edgar, Artem Babaian, Dmitry Meloshko 3, Tomer Altman, Ryan J. McLaughlin 2, Jeff
Taylor, Victor Lin, and Gherman Novakovsky 2.
# Author list randomized
1 Institut Pasteur & CNRS
2 University of British Columbia
3 Center for Algorithmic Biotechnologies, Saint Petersburg State University.
Contact:
artem@rRNA.caProject Website:
http://serratus.ioSource Code:
https://github.com/ababaian/serratusLicense: GPLv3
Abstract
Despite intense efforts to sequence and analyze SARS-CoV-2 isolates, understanding
of the virus's provenance is limited by incomplete genomic characterization of the
Coronaviridae (CoV) family.
Serratus is an open science project for the discovery of new virus sequences by
aligning all RNA-seq, meta-genomic, meta-transcriptomic and environmental NGS data
in the NCBI Short Read Archive (SRA).
Here we report a preliminary survey of 1.14 million sequence libraries (26.78
petabases) where we have uncovered several previously unreported CoV species, and
identified thousands of CoV+ libraries.
To perform this ultra-high throughput CoV search, we leveraged AWS cloud HPC with
a 22,500 vCPU cluster. Using a hyper-parallelized architecture we could bypass
conventional networking and disk IO bottlenecks to achieve a processing rate in
excess of 500,000 sequencing libraries per day, at a cost of ~$0.01 per library.
We are building a 100% open data-set of all viral sequences in the SRA to accelerate
the translation of these data. All notebooks, source-code, raw and processed
sequence data generated in Serratus is freely available within 24h of discovery.
Expanding the known repertoire of coronaviruses will not only help determine
the origins of this pandemic, but it can help prevent another one.