BCC2020 has ended
➞ Set your timezone before doing anything else on this site (home page, on the right)
Limit what is shown by Type, Category, or Hemisphere
Registration closed July 15.

BCC2020 is online, global, and affordable. The meeting and training are now done, and the CoFest is under way.

The 2020 Bioinformatics Community Conference brings together the Bioinformatics Open Source Conference (BOSC) and the Galaxy Community Conference into a single event featuring training, a meeting, and a CollaborationFest. Events run from July 17 through July 25, and is held in both the eastern and western hemispheres.

Back To Schedule
Tuesday, July 21 • 13:05 - 13:10
COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource 🍐

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!


The presenter(s) will be available for live Q&A in this session (BCC West).

Andrea Guarracino 5, Peter Amstutz 2, Thomas Liener 3, Michael Crusoe 4, Adam Novak 6, Erik Garrison 6, Tazro Ohta 7, Bonface Munyoki 1, Danielle Welter 8, Sarah Zaranek 2, Alexander (Sasha) Wait Zaranek 2, Pjotr Prins 1

1 Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science
Center, Memphis, TN, USA.
2 Curii Corporation, Boston, MA, USA.
3 independent.
4 Department of Computer Science, Faculty of Sciences, Vrije Universiteit Amsterdam, The
5 Centre for Molecular Bioinformatics, Department of Biology, University Of Rome Tor Vergata,
Rome, Italy.
6 UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
7 Database Center for Life Sciences, Tokyo, Japan.
8 Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Luxembourg.

Project Website: http://covid19.genenetwork.org/
Source Code: https://github.com/arvados/bh20-seq-resource
License: Apache 2.0

As part of the COVID-19 Virtual Biohackathon 2020 we formed a working group to create a
COVID-19 Public Sequence Resource (COVID-19 PubSeq) for SARS-CoV-2 virus sequences. Our goal
was to create a repository that had a low barrier to entry for uploading and analyzing sequence
data. We followed FAIR data practices: data are published with public domain (CC0) or creative
commons 4.0 (CC-BY-4.0) license, structured metadata is validated against standard ontologies,
and, most importantly, reproducible workflows are executed after the upload in order to provide
up-to-date results rapidly and in standardized data formats.

Existing data repositories for viral data include GISAID, EBI ENA and NCBI. These repositories allow
for free sharing data, but do not enforce strict quality control on submitted data or metadata, and
do not add value in terms of running additional analysis. In addition, some databases have a
restricted license which prevents data from being used in online web services and on-the-fly
computation, hindering research.

We created a prototype sequence resource within one week by leveraging existing technologies,
such as the Arvados Cloud platform (http://arvados.org), Common Workflow Language (CWL)
(http://commonwl.org), and the many free and open source software packages that are available
for bioinformatics. Pipelines developed by several teams were combined into an omnibus
pangenome analysis workflow. Computing resources for this project were generously donated by
Amazon Web Services.

avatar for Andrea Guarracino

Andrea Guarracino

PhD student, Centre for Molecular Bioinformatics, Department of Biology, University Of Rome Tor Vergata, Rome, Italy.

Tuesday July 21, 2020 13:05 - 13:10 EDT