Loading…
BCC2020 has ended
➞ Set your timezone before doing anything else on this site (home page, on the right)
➞ Limit what is shown by Type, Category, or Hemisphere
➞ Registration closed July 15.

BCC2020 is online, global, and affordable. The meeting and training are now done, and the CoFest is under way.

The 2020 Bioinformatics Community Conference brings together the Bioinformatics Open Source Conference (BOSC) and the Galaxy Community Conference into a single event featuring training, a meeting, and a CollaborationFest. Events run from July 17 through July 25, and is held in both the eastern and western hemispheres.

Meeting-West [clear filter]
Sunday, July 19
 

10:00 EDT

BCC2020 Conference Day 1: West
Keynotes, accepted talks, posters, demos, and networking in the West.

Sunday July 19, 2020 10:00 - 15:00 EDT
Joint

10:01 EDT

Welcome
Welcome to the 2020 Bioinformatics Community Conference (BCC2020)!

We'll introduce the conference, talk about the logistics of this online event, and present last minute news. This session will also include a tribute to James Taylor, one of the founders and PIs of the Galaxy Project who had a huge impact on open source and open science.

We will also hold a short icebreaker or two.

Moderators
avatar for Dave Clements

Dave Clements

Training and Outreach Coordinator, Galaxy Project, Johns Hopkins University
avatar for Nomi Harris

Nomi Harris

BOSC Chair, LBNL
This is my 10th year chairing or co-chairing BOSC, the Bioinformatics Open Source Conference.In 2020, BOSC is part of the online Bioinformatics Community Conference, BCC2020.

Sunday July 19, 2020 10:01 - 10:30 EDT
Joint

10:30 EDT

West Keynote 1: How Open Source has Changed the World
Lincoln Stein, Ontario Institute for Cancer Research

This keynote will be presented live.

Abstract

During the week of March 16, 2020, the Ontario universities of Waterloo, Toronto, and McMaster closed their campuses due to the COVID-19 outbreak. Just a few days later, a small group of students who suddenly found themselves with lots of free time mounted a web site called flatten.ca to collect self-reported symptoms from individuals with COVID-19 and to display the distribution of cases across the country. On the first day it opened, flatten.ca had about 300 visitors. Within two weeks this number had swelled to 337,000 and continues to grow. The system is now used by public health authorities across the country, has been adopted by the City of Montreal as its official COVID-19 tracking system, and has spawned similar sites in locales as far away as Somalia. The students did not need to write a research grant proposal, apply to a health data registry for access, seek REB approval, or obtain software licenses. They perceived an urgent need, applied open source tools and methodologies, and built a fully functional system in record time, well ahead of the "professionals" in academia and industry.

This is the world that the pioneers of Open Source envisioned. One in which a passionate community of individuals can turn an idea into reality with a few keystrokes by building on top of a large set of unencumbered high quality tools, techniques and datasets.

However, it doesn't always go this way. In biomedical research we continue to be encumbered by antiquated protocols for accessing health data, stymied by published descriptions of computational protocols that are faulty or incomplete, impeded by the logistics of moving large data sets around, and blindered by restrictive data usage conditions that discourage the creative integration of diverse datasets. In this talk, I will look back over the progress we have made, and then look forward to the new paradigms for code and data sharing that promise to make success stories like flatten.ca the rule rather than the exception.


This keynote will be introduced by Nomi Harris.

Speakers
avatar for Lincoln Stein

Lincoln Stein

OICR
Lincoln Stein focuses on supporting biomedical research both in Ontario and around the world by making large and complex biological datasets findable, accessible and usable.Prior to joining OICR in 2006, Dr. Stein played an integral role in many large-scale data initiatives at Co... Read More →


Sunday July 19, 2020 10:30 - 11:15 EDT
Joint

11:30 EDT

BOSC West Session 1a: Sequencing & analysis 🍐
The first talk session of BCC2020 is split into multiple tracks.  This track will include talks to submitted to the BOSC track.

Moderators
avatar for Chris Fields

Chris Fields

Director, HPCBio, University of Illinois Urbana-Champaign
I am a reformed molecular microbiologist associatively directing a moderately sized group of very smart people from crazy diverse backgrounds, and we all work on anything and everything sequence-related.

Sunday July 19, 2020 11:30 - 12:20 EDT
BOSC
  Meeting-West

11:30 EDT

Galaxy West Session 1: Applications and use cases πŸŒ€
The first talk session of BCC2020 is split into multiple tracks.  This track will include talks to submitted to the Galaxy track.

Moderators
avatar for Delphine Lariviere

Delphine Lariviere

Penn State University
Post-doc in the Galaxy Team (Nekrutenko Lab). Works on bacterial genomics, assembly, RNA Seq, TnSeq. Also interested in evolution, metagenomics, epigenetics and visualisation.

Sunday July 19, 2020 11:30 - 12:45 EDT
Galaxy
  Meeting-West

11:31 EDT

Cooperative bacteriophage genome annotation in the biologist-friendly Galaxy and Apollo platforms πŸŒ€
➞ Abstract

Jolene Ramsey 1,2, Cory Maughmer 1,2, Anthony Criscione 1,2, Mei Liu 1,2, Ry Young 1,2, Jason J. Gill 1,3

  1. Center for Phage Technology, Texas A&M University, College Station, Texas, USA
  2. Department of Biochemistry and Biophysics, Texas A&M University, College Station, Texas, USA
  3. Department of Animal Science, Texas A&M University, College Station, Texas, USA

The presenter(s) will be available for live Q&A at the end of this session (BCC West).
In the modern genomic era, scientists without extensive bioinformatic training need to apply advanced computational analyses to genome annotation. At the Center for Phage Technology (CPT), we use two open source, web-based platforms: Galaxy, for reproducible computational analyses, and Apollo, a collaborative genome annotation editor, to facilitate annotation of phage genomes. The development and expansion of the Galaxy-Apollo bridge has been discussed at prior Galaxy Community Conferences, and the critical contributions by many former and current community members are gratefully acknowledged. In this presentation, we will describe how scientists and students have been trained to use semi-automated workflows in Galaxy and Apollo for collaborative annotation of genomes, including feature calling, contextualized functional prediction, and comparative genomics.
Unlike the genomes of most cellular life forms, phage genomes are usually a single contiguous molecule <200,000 bases in length. Their size allows high standards for complete, evidence-based annotations, and is amenable to genomics education settings. The CPT Galaxy and Apollo system is used for original biological research and development of new bioinformatic tools to analyze many individual phage genomes, as well as clusters of related phages. Our robust suite of phage-oriented tools includes open source applications such as PhageTerm, as well as unique programs for finding Shine-Dalgarno sequences, a collection of tools used for confident identification of lysis genes, and identification of interrupted genes that contain frameshifts or introns. The step-wise process moves all aspects of control and choice into the user’s court. In comparison to widely used automated and fast command-line annotation methods, our integrated and flexible approach benefits from trained human intervention to result in high-quality final annotations.
The CPT has educated a steady stream of scientists, as well as both undergraduate and graduate students, informally and through formal university course offerings on using this Galaxy-Apollo infrastructure to annotate phage genomes. The resulting data, continuously collated on our BioProject page (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA222858), is deposited in public sequence repositories and published regularly. By creating a free user account, local or international teams can begin their own analyses. Accompanying user training material in the Galaxy Training Network format is hosted at https://cpt.tamu.edu/training-material/.
Project Website: https://cpt.tamu.edu/galaxy-pub

Speakers
avatar for Jolene Ramsey

Jolene Ramsey

Postdoc, Texas A&M University
I love to study the viruses of bacteria, called bacteriophages, or phages. Ask me about viruses, or my favorite podcast, This Week in Virology.



Sunday July 19, 2020 11:31 - 11:45 EDT
Galaxy

11:31 EDT

Digital Expression Explorer 2: a repository of 8 trillion uniformly processed RNA-seq reads and still counting 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC East).

Mark Ziemann 1, Antony Kaspi 2

1 Deakin University, Geelong, Australia. Email: m.ziemann@deakin.edu.au
2 The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia.

Project Website: http://dee2.io/
Source Code: https://github.com/markziemann/dee2
License: (example: GNU General Public License v3.0)

RNA-seq is currently the most popular method for transcriptome-wide gene expression profiling, but despite data-sharing requirements, rates of data reuse are still very low. This is due to the need for high end computing infrastructure and pipelines that require command line expertise for raw data processing. Resources such as Recount2, ARCHS4 and Digital Expression Explorer 2 (DEE2) provide easy access to some uniformly processed data, with queryable web interfaces, bulk downloads and R packages.

Keeping up with the rapid pace of data deposition to the Short Read Archive (SRA) is proving a challenge. As of May 2020, there are 1.49M samples available in SRA for the nine organisms included in DEE2, and of these 0.88M are available as processed data in DEE2 (Figure 1). This makes DEE2 coverage about twice as extensive as the next largest dataset (ARCHS4). Since original publication in 2019, DEE2 has grown from 5.3 to 8.05 T mapped reads.

In this presentation I will outline the challenges and strategies in maintaining and growing resources of this scale. In addition we will discuss recent enhancements including direct integration of the web interface to Degust (http://degust.erc.monash.edu/), a popular web based tool for statistical analysis of RNA-seq data. The R package getDEE2 has been extensively updated and submitted to BioConductor. It allows programmatic access to DEE2 datasets in the form of SummarizedExperiment objects that are compatible with many downstream analysis tools in the BioConductor ecosystem. Together these advances are helping DEE2 to achieve the goal of making all RNA-seq data freely available to everyone.


Speakers
avatar for Mark Ziemann

Mark Ziemann

Deakin University
### Hi there πŸ‘‹I am a Lecturer and researcher in computational biology at Deakin University, Australia. Our group is focused on building data resources and software tools to accelerate biomedical discovery. We collaborate closely with clinicians and biologists to get the most out... Read More →



Sunday July 19, 2020 11:31 - 11:45 EDT
BOSC

11:45 EDT

Community genome annotation integrates with Galaxy via Apollo providing greater integration and more functional annotation options πŸŒ€
➞ Abstract 

Nathan Dunn 1, Helena Rasche 2, Anthony Bretaudeau 3, Ian Holmes 4

  1. Lawrence Berkeley National Lab, Berkeley, CA
  2. University of Freiburg, Freiburg, Germany
  3. French National Institute for Agriculture, Food, and Environment (INRAE), Rennes, France
  4. University of California Berkeley, Berkeley, CA

The presenter(s) will be available for live Q&A at the end of this session (BCC West)

Speakers
avatar for Nathan Dunn

Nathan Dunn

Software Developer, Lawrence Berkeley National Lab



Sunday July 19, 2020 11:45 - 11:50 EDT
Galaxy

11:45 EDT

Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Sam Kovaka 1, Yunfan Fan 2, Bohan Ni 1, Winston Timp 2, Michael C. Schatz 1,3,4
Email: skovaka1@jhu.edu

1 Department of Computer Science, Johns Hopkins University, Baltimore, MD.
2 Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 3. Department of Biology, Johns Hopkins University, Baltimore, MD
4. Cold Spring Harbor Laboratory, Cold Spring Harbor, NY

Project Source Code: https://github.com/skovaka/UNCALLED
License: MIT License

ReadUntil sequencing allows nanopore devices to selectively stop sequencing an individual read in real-time by ejecting it from the pore and immediately switch to another read. If reads could be rapidly mapped to large references while being sequenced, this would enable targeted sequencing of specific genomic regions or even specific genomes. However, most mapping methods require basecalling, which is computationally intensive and requires a significant amount of the read to be sequenced.

Here we present UNCALLED (Utility for Nanopore Current ALignment to Large Expanses of DNA), an open-source mapper rapidly matches raw streaming nanopore current signals to a large DNA reference without basecalling. This is accomplished by probabilistically considering all possible k-mers that the signal could represent, and then pruning the possibilities based on the reference genome sequence encoded using an FM-index. Importantly, UNCALLED dynamically adjusts the signal level model probability cutoffs during alignment to achieve both high accuracy and high speed when aligning the noisy signal data.

We used UNCALLED to deplete the sequencing of known bacterial genomes within a Zymo mock microbial community, enriching the remaining yeast sequence from ~20x coverage to ~100x. We also used UNCALLED to enrich for 148 human genes associated with hereditary cancers to 29.6x coverage (a 5.6 fold increase) using a single MinION flowcell, enabling accurate detection of SNPs, indels, structural variants (SVs), and methylation in these genes. Notably, twice as many SVs were detected compared to 50x coverage Illumina sequencing, verified by whole-genome nanopore and PacBio HiFi sequencing. Finally, we show that UNCALLED could be used to enrich larger gene panels such as all 717 genes in the COSMIC Census, or be used with cDNA/RNA sequencing, for example to deplete high- abundance transcripts.



Speakers
SK

Sam Kovaka

Johns Hopkins University



Sunday July 19, 2020 11:45 - 11:50 EDT
BOSC

11:50 EDT

THAPBI PICT -- a metabarcoding analysis pipeline developed as a Phytophthora ITS1 Classification Tool 🍐
β†’ AbstractSlidesVideo

The presenter(s) will be available for live Q&A in this session (BCC West).

Peter Cock 1, David Cooke 2, Leighton Pritchard 3

1 Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, UK
2 Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee, UK
3 Strathclyde Institute of Pharmacy & Biomedical Sciences, Glasgow, UK

Repository: https://github.com/peterjc/thapbi-pict/
Documentation: https://thapbi-pict.readthedocs.io/
License: MIT

Molecular barcodes are central to environmental monitoring and identification of species present in a
sample, and use PCR primers to amplify a diagnostic genome region of the organisms of interest. We are
interested in metabarcoding where multiple samples are multiplexed for high-throughput sequencing on the
Illumina platform, using overlapping paired end reads. Each sample yields a collection of marker sequences,
and matching these to a database of known species produces a taxonomic breakdown reflecting community
composition,
THAPBI PICT is a metabarcoding tool we developed for the UK funded Tree Health and Plant Biose-
curity Initiative (THAPBI) Phyto-Threats project, which focused on identifying Phytophthora species in
commercial tree nurseries. Phytophthora (from Greek meaning plant-destroyer) are economically important
plant pathogens, important in both agriculture and forestry. This project targeted an ITS1 marker (Internal
Transcribed Spacer one, a region found in eukaryotic genomes between the 18S and 5.8S rRNA genes) with
nested primers to identify Phytophthora species. By varying primer settings and using a custom database,
THAPBI PICT can be applied to other organisms and/or barcode marker sequences - making it more than
just a Phytophthora ITS1 Classification Tool (PICT).
The analysis pipeline starts from demultiplexed paired FASTQ files, as produced by the Illumina MiSeq
platform. These are quality trimmed, overlapping reads merged and primer trimmed (calling external tools)
and then deduplicated giving a much smaller list of unique sequences and associated read counts (passing a
minimum count threshold intended to exclude "noise"). These are matched to a curated database using a
range of methods, producing both plain text and formatted Excel output. An edit graph in XGMML format
is also produced for display in Cytoscape and other visualisation tools.
THAPBI PICT is released as open source software under the MIT licence. It is written in Python, a free
open source language available on all major operating systems. Version control using git hosted publicly on
GitHub is used for the source code, documentation, and database builds including tracking the hand-curated
reference set of Phytophthora ITS1 sequences. Continuous integration of the test suite is currently run on
both TravisCI and CircleCI. Software is released to the Python Packaging Index (PyPI) as standard for
the Python ecosystem, and additionally packaged for Conda via the BioConda channel. This offers simple
installation of the tool itself, and all the command line dependencies on Linux or macOS. The documentation
is currently hosted on Read The Docs, updated automatically from the GitHub repository.


Speakers
avatar for Peter Cock

Peter Cock

The James Hutton Institute
Bioinformatician at the James Hutton Institute, a member of the BOSC organizing committee, treasurer of the Open Bioinformatics Foundation, and a core developer on the Biopython project.



Sunday July 19, 2020 11:50 - 11:55 EDT
BOSC

11:50 EDT

Computational chemistry analysis using Galaxy: Exploring antigen-antibody binding patterns for MUC1-AR20.5 πŸŒ€
➞ Abstract

Christopher Barnett 1, Tharindu Senapathi 1, Sean Collins 2, Kyllen Dilsook 2, Natalie Terry 2

  1. Scientific Computing Research Unit and Department of Chemistry, University of Cape Town, Rondebosch, 7701, South Africa. Email: chris.barnett@uct.ac.za
  2. Department of Chemistry, University of Cape Town, Rondebosch, 7701, South Africa

The presenter(s) will be available for live Q&A at the end of this session in both BCC West and BCC East.

Speakers
avatar for Chris Barnett

Chris Barnett

Lecturer, University of Cape Town



Sunday July 19, 2020 11:50 - 12:05 EDT
Galaxy

11:55 EDT

Please contribute to FASTQE so I don’t have to 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC East).

Andrew Lonsdale1,2,

1 Peter MacCallum Cancer Centre, Melbourne, Victoria 3000, Australia. Email: andrew.lonsdale@petermac.org
2 Sir Peter MacCallum Department of Oncology, The University of Melbourne, Victoria 3010, Australia

Project Website: http://fastqe.com
Source Codehttps://github.com/lonsbio/fastqe
License: MIT License

FASTQE is a utility for viewing the quality of biological sequence data as emoji . It
takes the FASTQ format, summarises the average quality score per base-position, and
transcribes each ASCII-encoded Phred summary score into a corresponding emoji to see the
good , the bad ,and the ugly of sequencing data.

Initially just a proof of concept at the end of a 2016 PyConAU talk, it has gradually evolved
into a Python package that is also available as a command line program. It can be
installed both via PyPI and Bioconda. When invoked from the command line it can also
display the minimum and maximum quality scores per position, and bin quality
scores into a reduced set of emoji.

Despite little promotion beyond social media (@fastqe), it has gained some popularity.
FASTQE has been used for an undergraduate command line workshop [1], presentations,
and workshops. Surprisingly , there have even been serious uses of the tool. Using
FASTQE, it was found that artefacts in single-cell RNA-seq data can increase the burden of
error correction in cell barcodes, and revealed at least one case of a software bug that
can lead to incorrect barcode correction .

Despite these compelling use cases, FASTQE has a bus-factor of 1. In order to provide
a more valuable tool for bioinformatics training, education and outreach, contributions are
needed. This presentation will demonstrate the functionality of FASTQE, outline the current
status of the project, a roadmap for enhancements, and a call for more contributions to this
open source project. Everyone knows this is a silly idea . This talk will persuade future
contributors that maybe it isn't a silly as it sounds .

[1] Rachael St. Jacques, Max Maza, Sabrina Robertson, Guoqing Lu, Andrew Lonsdale, Ray A Enke (2019). A Fun
Introductory Command Line Exercise: Next Generation Sequencing Quality Analysis with Emoji!. NIBLSE
Incubator: Intro to Command Line Coding Genomics Analysis, (Version 2.0). QUBES Educational Resources.
doi:10.25334/Q4D172

Speakers
AL

Andrew Lonsdale

Peter MacCallum Cancer Centre, Melbourne, Victoria 3000, Australia



Sunday July 19, 2020 11:55 - 12:00 EDT
BOSC
  Meeting-West

12:00 EDT

A reproducible workflow for amplicon-based microbial community analysis using the drake R package 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Rodrigo Ortega-Polo 1, Shefali Vishwakarma 2,3, Lan Tran 4, Amanda Gregoris 4, Marta Guarna 4

1 Lethbridge Research and Development Centre, Agriculture and Agri-Food Canada; Lethbridge, Alberta,
Canada. Email: rodrigo.ortegapolo@canada.ca
2 Lethbridge Research and Development Centre, Agriculture and Agri-Food Canada; Lethbridge, Alberta,
Canada.
3 Department of Molecular Biology and Biochemistry, Simon Fraser University; Surrey, British Columbia,
Canada.
4 Beaverlodge Research Farm, Agriculture and Agri-Food Canada; Beaverlodge, Alberta, Canada.

Project Website: https://github.com/BeeCSI-Microbiome/dada2_drake_workflow
Source Code: https://github.com/BeeCSI-Microbiome/dada2_drake_workflow
License: MIT License

The use of workflow management systems promotes best practices in computational biology such
as reproducibility, provenance tracking and documentation of steps and parameters used in
analyses. Furthermore, the ability to restart workflows from a given point in the analysis instead of
starting over provides an efficient way for developing data analysis pipelines. The drake R package
is a framework for workflow management that allows users to design and visualize workflows
status in a reproducible and scalable manner (Figure 1). In our work, we used drake to design a
pipeline for amplicon-based microbial community data using DADA2 for denoising and taxonomic
classification, phyloseq and other R packages for visualization and data tidying. We implemented
this workflow for the analysis of 16S rRNA microbial community datasets from the honey bee gut
microbiome. This workflow has the advantage of enabling users to evaluate microbial communities
with amplicon sequencing data working entirely within R.

Speakers
RO

Rodrigo Ortega-Polo

Agriculture and Agri-Food Canada



Sunday July 19, 2020 12:00 - 12:05 EDT
BOSC

12:05 EDT

Q & A πŸŒ€
Question and Answer session for the just finished talks.

Moderators
avatar for Delphine Lariviere

Delphine Lariviere

Penn State University
Post-doc in the Galaxy Team (Nekrutenko Lab). Works on bacterial genomics, assembly, RNA Seq, TnSeq. Also interested in evolution, metagenomics, epigenetics and visualisation.

Sunday July 19, 2020 12:05 - 12:10 EDT
Galaxy

12:05 EDT

SigBio-Shiny: A standalone interactive application for detecting biological significance on a set of genes 🍐
→ Abstract


The presenter(s) will be available for live Q&A in this session (BCC East).

Sangram Keshari Sahu

Independent Researcher, Banglore, India.

Email: sangramsahu15@gmail.com
Project Website: https://github.com/sk-sahu/sig-bio-shiny
Source Code: https://github.com/sk-sahu/sig-bio-shiny
Licence: MIT Licence

Detecting biological significance is an essential step for any high-throughput sequence analysis.
Once sequence reads are mapped and assembled, this is followed by different quantification
analysis which ends up with a set of features (transcript/gene). Quickly exploring those features
together from different angles along with statistical inference gives a good idea about the
biology they are involved.
Doing these kinds of exploration for a particular organism requires an up to date annotation
database. Currently available online/API platforms support either very few or only model
organisms. Apart from that, reproducibility is a primary issue as databases continually updated.
To tackle these problems I am presenting SigBio-Shiny, a standalone interactive application
based on R-Shiny which supports more than just model organisms with no requirement of
manual database maintenance. It leverages available open-source resources such as
Bioconductor's AnnotationHub to col ect the organism's updated database in real-time with
keeping track of what version of the database used. On top of this database, it helps with
detecting biological significance on a set of genes by doing gene mapping, enrichment analysis
of Gene Ontology (GO) and Pathway analysis.
Keywords: Interactive application, Significant biology, Non-model Organism, Annotation
database, Gene mapping, Gene Ontology (GO), Pathway, Enrichment analysis

Speakers
avatar for Sangram Keshari Sahu

Sangram Keshari Sahu

Genomics Data Scientist



Sunday July 19, 2020 12:05 - 12:10 EDT
BOSC
  Meeting-West

12:10 EDT

Streamlining accessibility and computability of large-scale genomic datasets with the NHGRI genome data science Analysis, Visualization, and Informatics Lab-Space (ANVIL) 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Michael C. Schatz 1, Anthony Philippakis 2, on behalf of the AnVIL project team 3
                                                
1 Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD. Email: mschatz@cs.jhu.edu
2 Broad Institute of MIT and Harvard, Cambridge, MA
3 City University of New York, Harvard, Oregon Health & Sciences University, Penn State, Roswell Park Cancer Institute, University of California Santa Cruz, University of Chicago, Vanderbilt, Washington University.


Project Website: https://anvilproject.org/ 
Source Code: https://github.com/anvilproject 
License: MIT License


The traditional model of genomic data sharing – centralized data warehouses such as dbGaP from which researchers download data to analyze locally – is increasingly unsustainable. Not only are transfer/download costs prohibitive, but this approach also leads to redundant siloed compute infrastructure and makes ensuring security and compliance of protected data highly problematic.
The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-Space, or AnVIL, inverts this model, providing a cloud environment for the analysis of large genomic and related datasets. By providing a unified environment for data management and compute, AnVIL eliminates the need for data movement, allows for active threat detection and monitoring, and provides elastic, shared computing resources that can be acquired by researchers as needed. AnVIL provides access to key NHGRI datasets, such as the CCDG (Centers for Common Disease Genomics), CMG (Centers for Mendelian Genomics), eMERGE (Electronic Medical Records and Genomics), as well as other relevant datasets.
The platform is built on a set of established components that have been used in a number of flagship scientific projects. The Terra platform provides a compute environment with secure data and analysis sharing capabilities. Dockstore provides standards based sharing of containerized tools and workflows. Bioconductor and Galaxy provide environments for users at different skill levels to construct and execute analyses. The Gen3 data commons framework provides data and metadata ingest, querying, and organization.
AnVIL provides a collaborative environment for creating and sharing data and analysis workflows for both users with limited computational expertise and sophisticated data scientist users. It provides multiple entry points for data access and analysis, including execution of batch workflows written in WDL, notebook environments including Jupyter and RStudio, Bioconductor packages for building analysis on top of AnVIL APIs and services, and will offer Galaxy instances for interactive analysis. It will be possible to integrate additional analysis environments through standard APIs.
Long-term, the AnVIL will provide a unified platform for ingestion and organization for a multitude of current and future genomic and genome-related datasets. Importantly, it will ease the process of acquiring access to protected datasets for investigators and drastically reduce the burden of performing large- scale integrated analyses across many datasets to fully realize the potential of ongoing data production efforts.
                                   
    

Speakers
MS

Michael Schatz

Johns Hopkins University



Sunday July 19, 2020 12:10 - 12:15 EDT
BOSC

12:10 EDT

Integrating and analyzing genotype, phenotype, and environmental data through CartograTree and Tripal Galaxy πŸŒ€
➞ Abstract

Irene Cobo-SimΓ³n 1, Nic Herndon 2, Margaret Staton 3, Emily Grau 4, Sean Buehler 4, Peter Richter 4, Risharde Ramnath 4, Charlie Demurjian 4, Abdullah Almsaeed 3, Jill Wegrzyn 4

  1. Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA
  2. Department of Computer Science, East Carolina University, NC, USA
  3. Department of Entomology and Plant Pathology, University of Tennessee, Knoxville, TN, USA
  4. Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA

The presenter(s) will be available for live Q&A at the end of this session (BCC West)

Speakers
avatar for Irene Cobo

Irene Cobo

Postdoctoral Scholar, Department of Ecology and Evolutionary Biology, University of Connecticut
My research interest is mainly focused on evolutionary biology from a molecular perspective. In particular, I am interested in studying the genomic basis of adaptation and biodiversity.



Sunday July 19, 2020 12:10 - 12:25 EDT
Galaxy

12:15 EDT

A comprehensive benchmarking of WGS-based structural variant callers 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC East).

Varuni Sarwal 1,2, Sebastian Niehus 3,4, Ram Ayyala 1, Serghei Mangul 5

1 University of California, Los Angeles, CA 90095, USA. Email: sarwal8@gmail.com
2 Indian Institute of Technology Delhi, Hauz Khas, New Delhi, Delhi 110016, India
3 Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178 Berlin, Germany
4 CharitΓ©-UniversitΓ€tsmedizin Berlin, corporate member of Freie UniversitΓ€t Berlin,
Humboldt-UniversitΓ€t zu Berlin, and Berlin Institute of Health, CharitΓ©platz 1, 10117 Berlin, Germany
5 University of Southern California, Los Angeles, CA 90089, USA

Project Website: https://github.com/Mangul-Lab-USC/benchmarking_SV_publication
Source Code: https://github.com/Mangul-Lab-USC/benchmarking_SV_publication
License: MIT License

Structural variants (SVs) are genomic regions that contain an altered DNA sequence due to
deletion, duplication, insertion, or inversion, and have varying pathogenicity of disease.
Dissecting SVs from whole genome sequencing (WGS) data presents a number of challenges
and a plethora of SV-detection methods have been developed. Currently, there is a paucity of
evidence which investigators can use to select appropriate SV-detection tools. We evaluated the
performance of 15 SV-detection tools based on their ability to detect deletions from aligned
WGS reads using a comprehensive PCR-confirmed gold standard set of SVs to find methods
with a good balance between sensitivity and precision. While the number of true deletions is
3710, the number of deletions detected by the tools ranged from 899 to 82,225. 53% of the
methods reported fewer deletions than are known to be present in the sample. The length
distribution of detected deletions varied across tools and was substantially different from the
distribution of true deletions. 53% of tools underestimate the true size of SVs and deletions
detected by BreakDancer were the closest to the true median deletion length. We allowed
deviation in the coordinates of the detected deletions and compared deviations to the coordinates
of the true deletions from 0 to 10,000 bp. Manta achieved the highest f-score for all thresholds.
Methods with high specificity rates tend to also have significantly higher f-score and precision
rates. CLEVER was able to achieve the highest sensitivity while the most precise method was
PopDel. We assessed the performance of SV callers at coverages from 32x to 0.1x generated by
down-sampling the original WGS data. DELLY showed the highest F-score for coverage below
4x while Manta was the best performing tool from 8x to 32x. We assessed the effect of deletion
length on the accuracy of detection. Manta and CREST were the only tools with high specificity
for deletions shorter than 500bp. LUMPY was the only method able to deliver an F-score above
30% across all categories. Manta and LUMPY were the best performing tools for general
applications. Our recommendations can help researchers choose the best SV detection software,
as well as inform the developer community of the challenges of SV detection.

Speakers
avatar for Varuni Sarwal

Varuni Sarwal

Undergraduate student, UC Los Angeles



Sunday July 19, 2020 12:15 - 12:20 EDT
BOSC
  Meeting-West

12:20 EDT

Q&A for session B1a 🍐
The presenter(s) will be available for live Q&A in this session.

Moderators
avatar for Chris Fields

Chris Fields

Director, HPCBio, University of Illinois Urbana-Champaign
I am a reformed molecular microbiologist associatively directing a moderately sized group of very smart people from crazy diverse backgrounds, and we all work on anything and everything sequence-related.

Sunday July 19, 2020 12:20 - 12:25 EDT
BOSC

12:24 EDT

BOSC West Session 1b: Open data 🍐
The first talk session of BCC2020 is split into multiple tracks.  This track will include talks to submitted to the BOSC track.

Moderators
avatar for Chris Fields

Chris Fields

Director, HPCBio, University of Illinois Urbana-Champaign
I am a reformed molecular microbiologist associatively directing a moderately sized group of very smart people from crazy diverse backgrounds, and we all work on anything and everything sequence-related.

Sunday July 19, 2020 12:24 - 12:39 EDT
BOSC
  Meeting-West

12:25 EDT

Tripal: an example of successful open-source distributed team development 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Margaret Staton 1*, Abdullah Almsaeed 1, Noah Caldwell 1, Ethalinda Cannon 2, Valentin Guignon 3,
Doreen Main 4, Monica Polechau 5, Manuel Ruiz 3, Jill Wegrzyn 6, Bradford Condon 1, Stephen Ficklin 6,
Lacey Anne Sanderson 7

1. University of Tennessee, Knoxville, TN, USA. * Email: mstaton1@utk.edu
2. Iowa State University, Ames, Iowa, USA.
3. Bioversity International, Montpellier, France.
4. Washington State University, Pullman, WA, USA.
5. USDA-ARS National Agricultural Library, Beltsville, MD, USA.
6. University of Connecticut, Storrs, Connecticut, USA.
7. University of Saskatchewan, Saskatoon, Saskatchewan, Canada

Project Website: http://tripal.info/
Source Code: https://github.com/tripal
License: GNU General Public License v2.0

Tripal is an open-source software toolkit for building community-oriented biological databases
with a focus on genetic and genomic data. Beyond database structure and data access, it provides a
mechanism for data standardization and consistent implementation of FAIR principles across
communities. Currently, the Tripal software provides the foundation for over 30 databases
spanning animals, plants, insects, and more. Tripal has an active international developer
community working from academia, government agencies, and research institutes. Over the past
decade, the Tripal developer community has built a distributed team software development model
with over 30 developers from at least 10 different research groups and 3 countries. Two aspects to
Tripal have helped to make this a success. First, we have recently defined a community governance
structure with a project management committee and an internal advisory board. These function to
promote communication, provide a mechanism for shared decision making, and balance innovation
with sustainability. Second, Tripal's architecture consists of a core of common, centralized
functionality that can be easily expanded with shareable extension modules. This balances shared
community structure and reusable code with the need for individual research groups to customize
and develop quickly and independently. We have noted some disadvantages, but mostly
advantages, due to the unique community structure and software architecture.

Speakers
avatar for Margaret Staton

Margaret Staton

Assistant Professor, University of Tennessee, Knoxville
On the cyberinfrastructure side, I work on community genome databases (particularly Tripal software) and mobile apps for citizen science/outreach. I also do a lot with basic data analysis around genomes, transcriptomes, and epigenomes of plants.



Sunday July 19, 2020 12:25 - 12:30 EDT
BOSC

12:25 EDT

Automated real-time data analysis and visualizations for the SARS-CoV-2/Covid19 portal πŸŒ€
➞ Abstract

Marius van den Beek 1, Dannon Baker 2, Anton Nekrutenko 1

  1. Department of Biochemistry and Molecular Biology, Penn State University, University Park PA, USA
  2. Department of Biology, Johns Hopkins University, Baltimore MD, USA

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Marius van den Beek

Marius van den Beek

Penn State University



Sunday July 19, 2020 12:25 - 12:40 EDT
Galaxy

12:30 EDT

BioThings Explorer: A platform for distributed knowledge integration across biomedical APIs 🍐
Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

JiwenXin 1,SebastianLelong 1, XinghuaZhbionitioou 1, MarcoCano 1, GingerTsueng 1, ChunleiWu 1, Andrew
Su 1

1 Scripps Research, 10550 North Torrey Pines Road, La Jolla, CA 92037, kevinxin@scripps.edu

Project Website: https://biothings.io/explorer/ 
Source Code: https://github.com/biothings/biothings_explorer 
License: Apache License

BioThings Explorer (BTE) represents a distributed biomedical data integration solution that enables complex queries to be constructed, executed by aligning and connecting disparate RESTful APIs. It facilitates exploring and querying the vast wealth of biomedical data, which is continuously being generated by investigators, granting users the opportunity to seek out logical relationships between bio-entities and discover hidden connections in biomedical data without the burden to build a centralized data warehouse.

BioThings Explorer leverages SmartAPI (https://smart-api.info), an API registry that extends the OpenAPI standard. SmartAPI records provide rich metadata info of the type of associations (e.g. Disease (input) -> treated_by -> Gene (output)) an API is able to deliver, as well as how to retrieve that association. (An example can be found at https://bit.ly/smartapi_opentarget). Together, these SmartAPI records form a metaknowledge graph (https://smartapi.info/registry/translator/meta-kg) that describes the compatibility of APIs based on shared input and output types. BioThings Explorer can then take advantage of the metaknowledge graph to automate the planning and execution of queries across the API network based on specific user requests.

Compared to traditional centralized data integration solutions, BTE offers several advantages. First, it can be easily extended by the community. Adding a new API into the distributed knowledge graph only requires the creation of a SmartAPI metadata record, not the addition of any new code to enforce standardized syntax. Because of its extensibility, over 27 APIs have already been integrated into BTE, covering 138 API operations and 14 semantic types. Second, querying source APIs on the fly guarantees that the data retrieved are always up-to-date with the source. Last, this approach is highly scalable, since the BTE client runs on each user's own computing infrastructure, so there is no centralized component that could become a single point of failure.

Through both the Python package and the web interface, BioThings Explorer can be used to answer two classes of queries -- "PREDICT" and "EXPLAIN". The EXPLAIN queries are designed to identify plausible reasoning chains to explain the relationship between two entities, for example, Why does imatinib have an effect on the treatment of chronic myelogenous leukemia (CML)? (try it live at CoLab: https://bit.ly/bte_explain_colab). And the PREDICT queries are designed to predict plausible relationships between one entity and an entity class, for example, What drugs might be used to treat hyperphenylalaninemia? (try it live at CoLab: https://bit.ly/bte_predict_colab).

Speakers
avatar for Jiwen Xin

Jiwen Xin

Scripps Research
I'm a senior staff scientist in Scripps Research. I'm a Ph.D. in Biology and a self-taught computer engineer. I love combining my expertise in both Biology and Computer Science to build scalable and high performance open source applications to facilitate biomedical research.



Sunday July 19, 2020 12:30 - 12:35 EDT
BOSC

12:35 EDT

Don’t worry about data management - use Cenzontle 🍐
β†’ Abstract

The presenter(s) will be available for live Q&A in this session (BCC West).

Asis Hallab 1 , VerΓ³nica Suaste 2 , Francisco RamΓ­rez 2 , Constantin Eiteneuer 1 , Thomas Voecking 1 , Alicia Mastretta-Yanes 2

1 JΓΌlich Research Center, Germany. Email: asis.hallab@gmail.com
2 CONABIO, Mexico.

Project Website: https://sciencedb.github.io/ 
Source Code: https://github.com/ScienceDb
License: GPL-3

The need for a feature complete flexible management suite capable of handling big distributed data 
In life sciences data often is diverse, interdisciplinary, and stored at different sites. The reproducibility crisis has long been recognized. In the US alone an annual loss of 28 billion dollars has been attributed to research funding spent on projects that yielded not reproducible results (doi.org/10.1371/journal.pbio.1002165). Identified causes are diverse but regularly comprise insufficient data management. Data should be findable, accessable, interoperable, and reusable (FAIR) and a concise data management plan is key to receiving funding and publication. The problem is that creating a suitable data management platform is a considerable software engineering task in itself, more so for diverse big data. And even more so if several distributed data warehouses shall be integrated. Efficient and reliable data management often has no ideal solution, because research groups need to do science not data warehouse software engineering.

Solution: Have software built your data administration warehouse for you
We present Cenzontle. A set of automatic software generators that create your custom data warehouse for you automatically. Define your data formats in standard JSON and get a fully functional warehouse with none to minimal coding effort. The warehouse comprises two interfaces. A graphical browser based one that follows Google’s material design standards and thus have both a professional look and intuitive handling. No documentation is needed to use it. Custom visualizations with Plotly can be integrated and help the scientist to explore the data and form hypotheses. A programmatic interface (API) allows data scientists to build exhaustive queries, execute them efficiently, and thus feed data directly into their analysis pipelines from any programming language. A luxurious IDE helps with query building and has a complete searchable documentation. Standard β€œCRUD” access functions are offered to all data models. Data can be created, also en mass by uploading tables. It can be read, searched, sorted, and separated into mouth sized subsets. Records can be updated and deleted, of course. Most importantly different data storages can be incorporated. Use any number of databases and servers you like. Relations between records even on different servers is included. Full security is guaranteed using standard authentication and role based authorization, verified on each standard access function.

Speakers
AH

Asis Hallab

JΓΌlich Research Center



Sunday July 19, 2020 12:35 - 12:40 EDT
BOSC

12:40 EDT

Q & A πŸŒ€
Question and Answer session for the just finished talks.

Moderators
avatar for Delphine Lariviere

Delphine Lariviere

Penn State University
Post-doc in the Galaxy Team (Nekrutenko Lab). Works on bacterial genomics, assembly, RNA Seq, TnSeq. Also interested in evolution, metagenomics, epigenetics and visualisation.

Sunday July 19, 2020 12:40 - 12:45 EDT
Galaxy

12:40 EDT

Q&A for session B1b 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (not sure yet wich hemisphere).

Moderators
avatar for Chris Fields

Chris Fields

Director, HPCBio, University of Illinois Urbana-Champaign
I am a reformed molecular microbiologist associatively directing a moderately sized group of very smart people from crazy diverse backgrounds, and we all work on anything and everything sequence-related.

Sunday July 19, 2020 12:40 - 12:45 EDT
BOSC

13:00 EDT

Broad Institute Data Sciences Platform Sponsor Table
The Broad Institute's Data Sciences Platform aims to accelerate science, transform medicine, and improve lives through data technologies. It is a diverse organization of more than 160 people including engineers, computational scientists and designers who work together and with many external collaborators to deliver high-quality open source software and services, such as the Genome Analysis Toolkit (GATK), the Cromwell workflow management system and Terra, the Broad Institute's cloud-based data access and analysis platform.

Please stop by and learn more about the Broad Data Sciences Platform. We are located on the first floor of the Poster / Demo building,

Speakers
avatar for Geraldine Van der Auwera

Geraldine Van der Auwera

Director of Outreach and Communications, Broad Institute Data Sciences Platform
I direct outreach and communication efforts for the software and services developed by the Data Sciences Platform at the Broad Institute, which include GATK, the Broad's open source toolkit for variant discovery analysis; the Cromwell/WDL workflow management system; and Terra.bio... Read More →



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

eLife Innovation Sponsor Table
eLife works to improve research communication through open science and open technology innovation.

eLife is a non-profit organisation inspired by research funders and led by scientists. Our mission is to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science.

eLife sponsored childcare at the 2018 joint conference, and again at the 2019 Galaxy Conference. This year eLife is sponsoring closed captioning for conference talks.

Please stop by and learn more about eLife. We are located on the first floor of the Poster / Demo building,

Speakers
avatar for Emmy Tsang

Emmy Tsang

Innovation Community Manager, Delft University of Technology



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

GigaScience Sponsor Table
GigaScience is an online open access, open data, open peer-review journal published by Oxford University Press and BGI. The journal offers β€˜big data’ research from the life and biomedical sciences, and on top of 'Omics research includes the growing range of work that uses difficult-to-access large-scale data, such as imaging, neuroscience, ecology, systems biology, and other new types of shareable data. GigaScience is unique in the publishing industry as it publishes all research objects (data, software tools, source code, workflows, containers and other elements related to the work underpinning the findings in the article). Promoting Open Science, all published software needs to be under an OSI-license, all supporting data must be available and open, and all peer review is carried out transparently. Presenting workflows via our GigaGalaxy.net server, novel work presented at the meeting utilising Galaxy is eligible to a 15% APC if it is submitted to our Galaxy series.

Please stop by and learn more about GigaScience. We are located on the first floor of the Poster / Demo building,

Speakers
avatar for Laurie Goodman

Laurie Goodman

Publishing Director, GigaScience Press
Laurie Goodman, PhD, is the Publishing Director for GigaScience Press, which publishes the international, open-science journals GigaScience and GigaByte. Both journals have won awards for Innovation in publishing. Dr. Goodman received her BS and MS from Stanford University in 1986... Read More →
avatar for Ken Cho

Ken Cho

Systems Programmer Analyst, GigaScience
avatar for Scott Edmunds

Scott Edmunds

Editor in Chief, GigaScience Press/BGI Hong Kong
Scott Edmunds is the Editor in Chief of GigaScience Press. With over 15 years experience in Open Access and Open Data publishing he is co-founder of CivicSight (formerly Open Data Hong Kong) and CitizenScience.Asia, and is on the Board of Directors of the Dryad Digital Repository... Read More →




Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P1-03: : A somatic variant-calling pipeline for the support of molecular tumor boards at German university hospitals 🍐
This poster will be presented live at BCC West.

Speakers
avatar for Wolfgang Maier

Wolfgang Maier

University of Freiburg
Interests:- Galaxy tool development- Variant calling tools and pipelines- User trainings



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P1-07: : Automating the annotation of biological data through semantic technologies and machine learning 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for LorcΓ‘n Pigott-Dix

LorcΓ‘n Pigott-Dix

PhD Student, Earlham Institute
I am exploring how to improve the automatic annotation of biological data through machine learning and semantic technologies. My background is in computational ecology, and I am interested in biology, natural language processing, and cultural evolution.


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P1-08: : BioThings Explorer: A platform for distributed knowledge integration across biomedical APIs 🍐
➞ Abstract

This poster will be presented live at BCC West, Poster Room P01-08. 

Speakers
avatar for Jiwen Xin

Jiwen Xin

Scripps Research
I'm a senior staff scientist in Scripps Research. I'm a Ph.D. in Biology and a self-taught computer engineer. I love combining my expertise in both Biology and Computer Science to build scalable and high performance open source applications to facilitate biomedical research.



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P1-10: : BLAST in a container 🍐
➞ Abstract

This poster will be presented live at BCC West.

Tom Madden, Christiam Camacho, Yuri Merezhuk, Yan Raytselis
National Center for Biotechnology Information, National Library of  Medicine, National Institutes of Health.  Email: madden@ncbi.nlm.nih.gov
Project Website: https://github.com/ncbi/blast_plus_docs
Source Codehttps://github.com/ncbi/docker/tree/master/blast
License: Public Domain
 
The Basic Local Alignment Search Tool (BLAST) is a very popular application for searching and aligning DNA and protein sequences.  BLAST is  widely used in many different environments and pipelines.  In order to support these use cases better, we are now making a containerized version of BLAST, using Docker, available.  This approach offers some  advantages including a reproducible run-time environment and the ability to work with bioinformatics workflow languages such as CWL.  Additionally, we are staging BLAST databases on some cloud providers, facilitating the use of these resources on the cloud.  We discuss the advantages of a containerized version of BLAST and show examples using the containers we provide.  
Additionally, we discuss work in progress on a Kubernetes based system to start our containerized version of BLAST on multiple machines in order to handle large search sets.
This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.

Speakers
avatar for Tom Madden

Tom Madden

NIH
Team Lead for BLAST at the NCBI/NLM/NIH


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P2-10: : Deploying Galaxy workflows in containers πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Bert Droesbeke

Bert Droesbeke

Data Scientist, VIB



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P3-01: : Earth System Modelling and data analysis with Galaxy Climate Science Workbench πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Anne Fouilloux

Anne Fouilloux

Research Software Engineer, University of Oslo
I am working on Galaxy Climate (development of tools, integration of climate data, training material).



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P3-04: : Enabling computational workflows with Tripal and Galaxy πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Sean Buehler

Sean Buehler

University of Connecticut



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P3-09: : Exploring chromatogram library-based data independent acquisition analysis using EncyclopeDIA within Galaxy framework πŸŒ€
➞ Abstract, Poster

This poster will be presented live at BCC West.

Speakers
avatar for Pratik Jagtap

Pratik Jagtap

Research Assistant Professor, University of Minnesota
Metaproteomics . DIA . Proteogenomics



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P3-10: : Extensions of READemption for the analysis of several RNA-seq based protocols 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers

Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P3-12: : Full-factorial examination of high-throughput microbiome sequencing workflows from sample preparation to bioinformatic analysis 🍐
➞ Abstract
The development of sequencing technologies to evaluate bacterial microbiota composition has allowed insight into the role of the microbiome in human health. However, the variety of methodologies used to prepare and analyze samples for microbiota composition can introduce artifacts, including errors and biases. These artifacts alter our perception of bacterial diversity and our final interpretation of microbiota differences among samples. Using a mock bacterial microbiota of known composition and abundance, we performed a translational bioinformatic pipeline evaluation of various PCR conditions, amplicon library preparation methods, and bioinformatic analyses to gain insight into methodological sources of artifacts.
Genomic DNA was extracted from pure cultures of individual mock bacterial isolates (n = 43), quantified, and then pooled. To compare the effects of PCR on the development of artifacts, we performed all possible permutations of three polymerases, three alternative primer pairs targeting varying regions of the 16S rRNA gene, two barcoding approaches, five elongation times, two annealing temperature offsets, and two amplicon cleanup methods. All individual PCR reactions were sequenced on an Illumina MiSeq platform. Bioinformatic analysis was performed with three different microbiome analysis pipelines, including DADA2, mothur, and QIIME2. Resulting sequence variants were classified as expected or unexpected, and missing members of the mock community were identified. Unexpected reads were further identified as an artifact representative of either chimeras using DECIPHER, mock community sequences containing mismatches or indels, primer dimers, 16S rRNA contamination, or non-16S rRNA off-target amplification.
We found that primer choice accounted for a significant amount of discord between the mock community and sequence output. Additionally, longer amplicon fragment lengths negatively impacted the quality of sequencing reads. Polymerase choice, annealing temperature, and elongation time negligible impacts on sequencing results. QIIME2 and DADA2 performed similarly using standard pipelines and produced the most accurate results. The use of mothur was associated with a high number of operational taxonomic units which were classified as contamination and increased the interpretation of community diversity.
By employing a defined mock community, this full factorial experiment allowed us to gather insight into methodological sources of pipeline artifacts and allow us to identify a methodology that results in an optimized workflow for improved examination of microbiota composition. This workflow enables full-circle analysis of samples with superior precision in comparison to current workflow standards.
This poster will be presented live at BCC West.

Speakers
avatar for Travis J. De Wolfe

Travis J. De Wolfe

Postdoctoral Scholar, University of Pittsburgh
I am a Postdoctoral Scholar in the Department of Biomedical Informatics at the University of Pittsburgh School of Medicine. The goal of my research is to use microbiological culture techniques and sequencing technologies to test theories regarding the role of bacterial communities... Read More →


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

13:00 EDT

P4-07: : GVL Demo: from Administrators to End-users πŸŒ€
➞ Abstract

This poster will be presented live at BCC East and BCC West.

Speakers
avatar for Nuwan Goonasekera

Nuwan Goonasekera

University of Melbourne



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P5-04: : Magic-BLAST an accurate RNA-seq mapper 🍐
➞ Abstract

This poster will be presented live at BCC West.

Grzegorz Boratyn, Jean Thierry-Mieg, Danielle Thierry-Mieg, Tom Madden
National Center for Biotechnology Information, National Library of  Medicine, National Institutes of Health.  Email: boratyng@ncbi.nlm.nih.gov

Project Website: https://ncbi.github.io/magicblast
Source Code: https://ftp.ncbi.nlm.nih.gov/blast/executables/magicblast/1.5.0/ncbi-magicblast-1.5.0-src.tar.gz
License: Public domain

Next-generation sequencing (NGS) technologies facilitate rapid analysis of gene expression across individuals,tissues, or conditions. Mapping reads against a reference genome is the first step in many genomics analysispipelines. It is therefore essential to map the reads reliably. Many algorithms were developed to tackle thisproblem however few of them can map well long reads.

We present Magic-BLAST, a tool for mapping NGS runs against one or multiple genomes or transcriptomes.It incorporates ideas from the MAGIC-AceView pipeline implemented within the BLAST code base. Magic-BLAST processes NGS reads in batches. It builds an index of a batch of reads and scans a BLAST database(a genome or transcriptome) for potential word matches. Each match becomes a seed for local alignmentcomputation. To avoid aligning to repeats Magic-BLAST first counts word occurrences in the genome andremoves frequent words from the read index. Finally, collinear local alignments are combined into spliced alignments.

Magic-BLAST is very robust across wide range of conditions. It works well with reads generated by Illumina,Roche 454, and PacBio platforms. It also provides very good performance when mapping against genomeswith biased compositions or from related species. Magic-BLAST is very accurate in intron discovery andoutperforms similar programs.

Magic-BLAST is convenient to use. It does not need any special tuning for different technologies andgenomes. It works well in different conditions using default parameters. It directly accesses reads stored inthe NCBI Sequence Read Archive (SRA), without the need to download the data beforehand. It works with FASTA and FASTQ files. It can align reads to sequences in BLAST databases or FASTA files and integrateswell with NCBI facilities and services.

Magic-BLAST is available as Linux, Mac, and Windows executable, docker image, and can be installed fromBioconda. Recently added features include better handling of nanopore reads and reporting results withskipping over regions with too many sequencing errors for reliable alignment.

This research was supported by the Intramural Research Program of the National Library of Medicine at the NIH.

Speakers
GB

Grzegorz Boratyn

BLAST developer at NCBI/NLM/NIH


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P5-07: : Open-Source, Large-Scale Set Similarity Search with Sketch 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
DB

Daniel Baker

PhD Candidate, Johns Hopkins University


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P5-10: : Planet Microbe: Toward the reintegration of oceanographic β€˜omics dataset in their environmental and physiochemical context 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Alise Jany Ponsero

Alise Jany Ponsero

postdoc, University of Arizona
I'm a postdoc at the University of Arizona, working on computational tools and cyberinfrastructures for metagenomics


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P5-12: : Progerin expression induces a significant downregulation of transcription from human repetitive sequences in iPSC-derived dopaminergic neurons πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Walter Arancio

Walter Arancio

Precarious researcher in Italy..., None
MyΒ main research line concerns theΒ molecular aspects that underlie the processes of human development and aging, and their effects on oncogenic transformation. In detail, my studies regard the mutual influences between [1] repeated sequences (LINE-1, ALU, et cetera), [2] ncRNAs... Read More →



Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P6-01: : pyGenomeTracks: Reproducible plots for multivariate genomic data sets 🍐
➞ Abstract

This poster will be presented live at BCC East and BCC West.

Speakers
avatar for Lucille Delisle

Lucille Delisle

Post-doc, EPFL SV ISREC UPDUB
Hi,I am a Post-doc in Denis Duboule lab working on gene regulation during development.For the scientific part, I analyzed various NGS methods including Hi-C, ATAC-seq, CUT&RUN. I recently developped a new method for single-cell RNA-seq, named baredSC.For the galaxy part, I develop... Read More →


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P6-02: : Reproducible, collaborative and exploratory data analysis using CyVerse VICE 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
RT

Reetu Tuteja

Science Analyst, CyVerse, University of Arizona


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P6-03: : Running and sharing the ENCODE atac-seq pipeline on Truwl 🍐
➞ Abstract

Please see a video of the demo here: https://youtu.be/J_hlAuopobY

This poster will be presented live at BCC West.

The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center (DCC) has produced a set of high-quality analysis pipelines that are used by the ENCODE Consortium and have been released to the community. The pipelines are described with the Workflow Description Language (WDL) and use containerization to enhance reproducibility. To increase the usability and dissemination of these pipelines further we have developed a web interface on Truwl (https://truwl.com/) for specifying parameters and inputs for the ENCODE atac-seq pipeline. The pipeline can be executed directly from the web interface on Google Cloud Platform (GCP). Once compute jobs are successfully executed, the analysis is posted back to Truwl to allow others to view the parameters, inputs, and outputs of previously executed pipelines. Automatically posting previously executed jobs provides increased transparency of computational experiments and provides examples for others to follow. All content on Truwl is open-access, web-searchable, and has unique identifiers making it easy to find and easy to share. In this software demonstration we will show the use of the atac-seq pipeline from Truwl by both specifying the parameters and inputs from the web interface individually and reusing a previously posted analysis.


Speakers
avatar for Karl Sebby

Karl Sebby

President, Truwl


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P6-04: : SigBio-Shiny: A standalone interactive application for detecting biological significance on a set of genes 🍐
➞ Abstract

This poster will be presented live at BCC East and BCC West.

Speakers
avatar for Sangram Keshari Sahu

Sangram Keshari Sahu

Genomics Data Scientist


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P6-08: : Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
SK

Sam Kovaka

Johns Hopkins University


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P6-12: : Towards more FAIR research software 🍐
➞ Abstract

This poster will be presented live at BCC East and BCC West.

Speakers
avatar for Mateusz  Kuzak

Mateusz Kuzak

Community Officer, The Netherlands eScience Center


Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

Poster / Demo West Session 1
The first poster and demo session of BCC2020.

Access the Poster / Demo hall through the "Go to Posters" button at the top left in the main BCC2020 Remo conference space.

Sunday July 19, 2020 13:00 - 13:45 EDT
Joint

14:00 EDT

BOSC West Session 2: Reproducibility and standards 🍐
The second accepted talk session of BCC2020 is split into multiple tracks.  This track will include talks to submitted to the BOSC track.  

Moderators
MM

Moni MuΓ±oz-Torres

Oregon State University

Sunday July 19, 2020 14:00 - 15:00 EDT
BOSC
  Meeting-West

14:00 EDT

Galaxy West Session 2: Extending the Galaxy ecosystem πŸŒ€
Presentations about extending the Galaxy ecosystem.

All speakers in this session will be available for live Q&A at the end of this session.

Moderators
Sunday July 19, 2020 14:00 - 15:00 EDT
Galaxy
  Meeting-West

14:01 EDT

Automated generation of training materials from markdown documents πŸŒ€
➞ Abstract 

Delphine Larivière
1,4, Frederick Tan 2, John Muschelli 2, James Taylor 3,4, Jeff Leek 2 and the
Galaxy Project 4

  1. Nekrutenko Lab, BMB department, Eberly College of Science, The Pennsylvania State University
  2. Leek group, Data Science Lab, Department of Biostatistics, Johns Hopkins Bloomberg School of Health
  3. Taylor Lab, Biology Department, Johns Hopkins University
  4. Galaxy Project https://galaxyproject.org/

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Delphine Lariviere

Delphine Lariviere

Penn State University
Post-doc in the Galaxy Team (Nekrutenko Lab). Works on bacterial genomics, assembly, RNA Seq, TnSeq. Also interested in evolution, metagenomics, epigenetics and visualisation.



Sunday July 19, 2020 14:01 - 14:15 EDT
Galaxy

14:01 EDT

Bionitio: building better bioinformatics tools with batteries included 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC East).

Authors: Peter Georgeson, Anna Syme, Jessica Chung, Michael Milton, Harriet Dashnow, Andrew Lonsdale, Clare Sloggett, Bernard Pope
License: MIT
URL: https://github.com/bionitio-team/bionitio
Publication: Georgeson, Syme et al. Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software. Gigascience 8, (2019).

The results-driven focus of bioinformatics means that shortcuts are often taken during software development for the sake of making something "that works". Furthermore, many bioinformaticians are not trained in software engineering, and research-oriented projects have limited budgets for quality assurance.

In response to this problem we have developed Bionitio, a tool that automates the process of starting new bioinformatics software projects following recommended best practices. With a single command, the user can create a new well-structured project in one of twelve programming languages. The resulting software is functional β€” carrying out a prototypical bioinformatics task β€” and thus serves as both a working example and a template for building new tools. Key features include command-line argument parsing, error handling, logging, defined exit status values, a test suite, a version number, standardised building and packaging, documentation, a standard open-source software license, revision control, and containerisation.

For example, the following command creates a new Python 3 project called skynet using the BSD 3 Clause license and creates a remote repository on GitHub for username cyberdyne:

bionitio-boot.sh -i python -n skynet -c BSD-3-Clause -g cyberdyne

Bionitio serves as a learning aid for beginner-to-intermediate bioinformatics programmers and provides an excellent starting point for new projects. This helps developers adopt good programming practices from the beginning of a project and encourages high-quality tools to be developed more rapidly. Bionitio has been used in several workshops, providing a common codebase for coordination of workshop materials and an extensible platform for the delivery of hands-on practical activities. Additionally, by providing complete working examples in many different languages, Bionitio acts as a kind of "Rosetta Stone" and is therefore an excellent vehicle for comparative programming skills transfer.

In this talk we will describe the design and implementation of Bionitio and demonstrate how it can be used to quickly start new open source bioinformatics projects.

Speakers
avatar for Bernie Pope

Bernie Pope

Victorian Health and Medical Research Fellow, Melbourne Bioinformatics, University of Melbourne
I am an Associate Professor at The University of Melbourne. My research focuses on applying computational techniques to biological questions, especially related to Human Genomics and Cancer.



Sunday July 19, 2020 14:01 - 14:15 EDT
BOSC
  Meeting-West

14:15 EDT

Enhancing rigor and reproducibility in biomedical research 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Jaqueline J. Brito 1,*, Jun Li 2, Jason H. Moore 3, Casey S. Greene 4,5, Nicole A. Nogoy 6, Lana X.
Garmire 2, Serghei Mangul 1,7

1 Dept. of Clinical Pharmacy, School of Pharmacy, University of Southern California, USA
2 Dept. of Computational Medicine & Bioinformatics, University of Michigan, USA
3 Dept. of Biostatistics, Epidemiology, and Informatics, Institute for Biomedical Informatics,
University of Pennsylvania, USA
4 Dept. of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, USA
5 Childhood Cancer Data Lab, Alex's Lemonade Stand, USA
6 GigaScience, Hong Kong
7 Quantitative and Computational Biology, University of Southern California, USA
*Email: britoj@usc.edu

Project Website: https://github.com/Mangul-Lab-USC/enhancing_reproducibility
License: CC BY 4.0 License

Computational methods reshaped the landscape of modern biology, generating new channels of
communications to publish and share the most recent techniques and methodologies. While the
dependence on computational tools of the biomedical community increases steadily, the
mechanisms ensuring open data, open software, and reproducibility are heterogeneously
enforced. Institutions, funders, and publishers offer different guidelines, or no guideline at all.
For instance, publications may cite software artifacts, key to reproduce research results, that
may become unavailable or depend on packages no-longer supported. Publications lacking fully
reproducible research significantly limit the role of reviewers in evaluating technical strength
and scientific contribution. Moreover, incomplete ancillary information for an academic
software package will likely bias and restrict any subsequent research produced with the tool.
In this presentation, we provide eight recommendations across four different domains to
improve three main principles: reproducibility, transparency, and rigor in computational
biology. These are the main principles which should be emphasized in life sciences curricula,
especially as assays and pipelines grow more complex than ever. We propose that a
combination of lowering the learning curve needed to maintain the three principles and more
strict guidelines are key to ensure adoption by the community. Ultimately, our
recommendations target fostering a sustainable data science ecosystem in biomedicine and life
science research.
Keywords: Reproducibility; Open science; Reproducible research; FAIR principles.

Speakers
JJ

Jaqueline J. Brito

Dept. of Clinical Pharmacy, School of Pharmacy, University of Southern California



Sunday July 19, 2020 14:15 - 14:20 EDT
BOSC

14:15 EDT

Integrating refgenie and Galaxy for reference data management: a proposal for IDC πŸŒ€
➞ Abstract

Ignacio Eguinoa
1,2 , Frederik Coppens 1,2

  1. Ghent University, Department of Plant Biotechnology and Bioinformatics, 9052 Ghent, Belgium
  2. VIB Center for Plant Systems Biology, 9052 Ghent, Belgium

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
IE

Ignacio Eguinoa

ELIXIR Belgium - VIB Center for Plant Systems Biology



Sunday July 19, 2020 14:15 - 14:20 EDT
Galaxy

14:20 EDT

Galaxy and its Tool Shed on Python 3: conclusion of a long journey πŸŒ€
➞ Abstract

Nicola Soranzo 1, Marius van den Beek 2

  1. Earlham Institute, Norwich Research Park, Norwich, UK. Email: nicola.soranzo@earlham.ac.uk
  2. Penn State University, University Park PA, USA.

The presenter(s) will be available for live Q&A at the end of this session in both BCC West and BCC East.

Speakers
avatar for Nicola Soranzo

Nicola Soranzo

Earlham Institute



Sunday July 19, 2020 14:20 - 14:25 EDT
Galaxy

14:20 EDT

Secondary analysis of publicly available omics data across almost 3 million publications 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Nicholas Darci-Maher 1, Kerui Peng 3, Dat Duong 1, Richard J. Abdill 2, Eleazar Eskin 1, Serghei Mangul 3

1 University of California, Los Angeles, California, USA. Email: niko.darcimaher@gmail.com
2 University of Minnesota, Minnesota, USA
3 University of Southern California, California, USA

Methods code: https://github.com/smangul1/data_reusability
License: MIT License

Abstract
As today's high throughput sequencing techniques become increasingly affordable and accurate,
the number of publicly available omics datasets is rapidly accumulating. Bioinformatics methods provide
unprecedented opportunities for analysis of omics datasets in quantitative biological research.
Traditionally, such research has included primary analysis of novel omics data developed as part of the
study. However, this data has the potential to be reused, and is often valuable beyond the scope of the
study that introduced it. Data-driven research by secondary analysis on existing datasets is becoming
more important. Increased availability of public omics data represents an opportunity to find novel
insights and discoveries across different datasets.
This study presents a quantitative analysis of the reusability of omics datasets in two online
repositories, the Sequence Read Archive (SRA) and the Gene Expression Omnibus (GEO). We
downloaded over 2.5 million publications from the PubMed Central Open Access corpus, and identified
those that referenced SRA or GEO datasets. We used these papers to examine reusability based on various
factors, including journal, repository, sequencing technology, and species. We find that most datasets are
never reused--these datasets are mentioned once in the study that introduced them, but then never
referenced again. In recent years, however, data reuse is rising. We aim to shed light on the landscape of
data sharing in the quantitative biology research community, and illuminate the benefits of secondary
analysis of omics data.

Speakers
ND

Nicholas Darci-Maher

University of California, Los Angeles



Sunday July 19, 2020 14:20 - 14:25 EDT
BOSC

14:25 EDT

Q & A πŸŒ€
Question and Answer session for the just finished talks.

Moderators
Sunday July 19, 2020 14:25 - 14:30 EDT
Galaxy

14:25 EDT

Q&A 🍐
The presenter(s) will be available for live Q&A in this session.

Sunday July 19, 2020 14:25 - 14:30 EDT
BOSC

14:30 EDT

CrowdGO: Gene Ontology prediction using a meta approach 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Maarten JMF Reijnders 1,2 and Robert M. Waterhouse 1,2

1 University of Lausanne, Lausanne, Switzerland.
2 Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Email: maarten.reijnders@unil.ch

Source code: https://gitlab.com/mreijnders/CrowdGO
License: GNU General Public License v3.0

Methods to predict protein functions- defined here as assigning Gene Ontology (GO) terms -
vary considerably in their underlying approach, with different methods employing techniques
such as sequence homology, machine learning, or text mining. This often results in dramatically
different sets of GO terms predicted for the same sets of proteins. These methods are reviewed
in the Critical Assessment of Functional Annotation competitions (CAFA) (Zhou 2019), but even
the best scoring methods can be inaccurate, and none truly stand out. To concurrently exploit
the strengths of each method, we developed a meta-predictor that evaluates the predictions of
multiple top-performing methods.
CrowdGO compares the predictions of different methods and uses a machine learning model to
improve the precision, recall, and f-max scores of the resulting meta-predictions. The model can
be trained based on user-selected prediction methods, or a pre-trained model can be used. The
pre-trained models are built using prediction tools that are exclusively open-source, easy to use,
and computationally non-demanding. CrowdGO includes Snakemake workflows to use existing
models for GO term prediction, or to train new models.
Using a model built with four input predictions from a sequence homology- based predictor, Wei2GO (Reijnders 2020), two protein domain based predictors, InterProScan (Mitchell 2019) and FunFams (Scheibenreif 2019), and a deep learning predictor, DeepGOPlus (Kulmanov 2019), CrowdGO increases both the precision and meaningful recall compared to each input method (Figure 1).
CrowdGO is fully open source and leverages other open source tools.It is straightforward to use, both due to the simplistic nature of the software and the accompanying snakemake pipelines. Due to the nature of its meta-prediction algorithm, it will stay relevant even when improved function prediction software becomes
available.


Speakers
MR

Maarten Reijnders

Department of Ecology and Evolution, University of Lausanne



Sunday July 19, 2020 14:30 - 14:35 EDT
BOSC

14:30 EDT

Implementation of the IEEE-2791-2020 standard (BioCompute Objects) in Galaxy via workflow invocations πŸŒ€
➞ Abstract

Charles Hadley King 1, Nicola Soranzo 2

  1. George Washington University, Washington D.C. USA
  2. Earlham Institute, Norwich Research Park, Norwich, UK

The presenter(s) will be available for live Q&A at the end of this session (BCC West)

Speakers
avatar for Charles Hadley King

Charles Hadley King

Senior Research Associate, George Washington University



Sunday July 19, 2020 14:30 - 14:35 EDT
Galaxy

14:35 EDT

Goslin - A grammar of succinct lipid nomenclature 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Nils Hoffmann 1, Dominik Kopczynski 1, Bing Peng 2, Robert Ahrends 3

1 Leibniz-Institut für Analytische Wissenschaften ­ ISAS ­ e.V., Otto-Hahn-Straße 6b, 44227
Dortmund, Germany. Email: nils.hoffmann@isas.de
2 Karolinska Institutet, Solna, Stockholm, Sweden.
3 Department of Analytical Chemistry, University of Vienna, Vienna, Austria.

Project Website: https://lifs.isas.de/goslin & https://apps.lifs.isas.de/goslin
Source Code: https://github.com/lifs-tools/goslin (main hub to implementations)
License: Apache v2 LICENSE & MIT License


Main Text of Abstract

We introduce the 'Grammar of Succinct Lipid Nomenclature' (Goslin), a polyglot grammar for
common lipid shorthand nomenclatures based on the LipidMaps nomenclature and the shorthand
nomenclature established by Liebisch et al. and used by LipidHome and SwissLipids, accompanied
by parser implementations in C++, Java, Python and R.

Lipid naming has evolved into several dialects which complicates the unified computational
treatment and parsing of lipid names. As a consequence, long and error-prone manual curation
often is necessary in order to streamline lists of lipid names for their processing in follow-up
analysis scripts, workflows, or tools, or for their submission to research data repositories. Goslin
was designed to address the following pressing issues in the lipidomics field especially: 1) to
simplify the implementation of lipid name handling for developers of mass spectrometry-based
lipidomics tools; 2) to offer a tool that unifies and normalizes the main existing lipid name dialects
enabling a lipidomics analysis in a high-throughput fashion.

Goslin and its parser implementations are thus designed to act as a library for the development of
lipidomics tools providing a standardized data structure for storing structural lipid information.
The parsing of lipid names as well as the lipid name generation are the main functions of Goslin. We
therefor defined a context free grammar (with ANTLR4) that defines rules and productions for all
structural properties of the lipid nomenclature, including mass spectrometry specific information
about unlabeled and heavy isotope labeled species, as well as fragments and adducts. We recently
added the calculation of masses and sum formulas, when the head group's sum composition is
known. Currently, the grammar covers 289 lipid classes within the seven most occurring lipid
categories in eukaryotic organisms, namely fatty acyls, glycerolipids, glycerophospholipids,
saccharolipids, sphingolipids, sterol lipids, and polyketides. The major advantages of using a
grammar rather than a manually coded parser are its flexibility and extensibility. Regular
expressions are also not suitable for parsing lipid names, since they are incapable of recognizing
nested patterns and can only recognize words from regular languages.

We provide implementations of Goslin in four major programming languages, namely C++, Java,
Python 3, and R to kick-start adoption and integration. Further, we set up a web service for users to
work with Goslin directly and via an OpenAPI-compliant REST API. All implementations are
available free of charge under a permissive open source license, binary releases are available from
Zenodo. We are currently working on making the libraries available via BioConda/BioContainers
and other community-facing repositories.

Speakers
NH

Nils Hoffmann

Leibniz-Institut fΓΌr Analytische Wissenschaften – ISAS – e.V.



Sunday July 19, 2020 14:35 - 14:40 EDT
BOSC

14:35 EDT

Porting the rCASC workflow for scRNA-Seq data analysis to Galaxy and the Laniakea Galaxy on-demand system πŸŒ€
➞ Abstract

Pietro Mandreoli 1, Luca Alessandrì 2, Marco Antonio Tangaro 3, Raffaele Calogero 2, Federico Zambelli 4

  1. Dept. of Biosciences, University of Milano - Italy.
  2. Dept. of Molecular Biotechnology and Health Sciences, University of Torino - Italy.
  3. Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, CNR - Italy.
  4. Dept. of Biosciences, University of Milano - Italy.

The presenter(s) will be available for live Q&A at the end of this session (BCC West)

Speakers
avatar for pietro mandreoli

pietro mandreoli

Dept. of Biosciences, University of Milano



Sunday July 19, 2020 14:35 - 14:40 EDT
Galaxy

14:40 EDT

Executable Research Article (ERA): Enrich a research paper with code and data 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

the eLife team and the Stencila team

(Presenter: Emmy Tsang, Innovation Community Manager, eLife; email: e.tsang@elifesciences.org)

Project Website: https://elifesci.org/reprodoc (this will be updated early June)
Source Code: https://github.com/stencila; https://github.com/elifesciences;
License: Apache License 2.0 (for Stencila); MIT (for eLife)

Main Text of Abstract

Code and data are important research output and integral to a full understanding of research
findings and experimental approaches in a paper. However, traditional research articles seldom
have these embedded in the manuscript's narrative, but instead, leave them as "supplementary
materials", if they are openly available.

With Executable Research Articles (ERAs), our vision is to enrich the traditional narrative of a
research article with code, data and interactive figures that can be executed in the browser,
downloaded and explored. It will give readers a direct insight into the methods, algorithms and key
data behind the published research.

We published our first demo ERA in February 2019. Over the past year, we have been working
closely with our collaborator Stencila to build an open tool stack that would enable our authors and
production team to easily publish ERAs at scale. In this talk, we hope to showcase the potential of
ERAs with examples and walk through how authors can enrich their traditional eLife paper using
Stencila Hub, through:

- Starting a Stencila Hub project linked to their eLife paper
- Converting the article to a reproducible notebook format of their preference, while preserving the relevant
 journal article metadata
- Uploading the data required to enable live re-execution of tables and figures in the article
- Replacing static tables and figures with code chunks that reproduce them

We will share our current vision of how ERAs will be integrated into our production workflow and
collect feedback. We also hope to engage participants in exploring potential functionalities for the
tool stack and building a community-driven roadmap.

Speakers
avatar for Emmy Tsang

Emmy Tsang

Innovation Community Manager, Delft University of Technology



Sunday July 19, 2020 14:40 - 14:55 EDT
BOSC

14:40 EDT

Galaxy, Selenium, and End-to-end Testing πŸŒ€
➞ Abstract

Oleg Zharkov 1, Dave Bouvier 2, Juan David Mendez 1, Bjârn Grüning 1, John Chilton 2

  1. Department of Computer Science, Albert-Ludwigs-UniversitΓ€t Freiburg
  2. Department of Biochemistry and Molecular Biology, Penn State University, University Park PA, USA.

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Oleg Zharkov

Oleg Zharkov

Albert-Ludwigs-UniversitΓ€t Freiburg



Sunday July 19, 2020 14:40 - 14:55 EDT
Galaxy

14:55 EDT

Q & A πŸŒ€
Question and Answer session for the just finished talks.

Moderators
Sunday July 19, 2020 14:55 - 15:00 EDT
Galaxy

14:55 EDT

Q&A 🍐
The presenter(s) will be available for live Q&A in this session.

Sunday July 19, 2020 14:55 - 15:00 EDT
BOSC
 
Monday, July 20
 

10:00 EDT

BCC2020 Conference Day 2: West
Keynotes, accepted talks, posters, demos, and networking in the West.

Monday July 20, 2020 10:00 - 15:00 EDT
Joint

10:01 EDT

Day 2 Welcome
Daily announcements and an icebreaker.

Moderators
avatar for Dave Clements

Dave Clements

Training and Outreach Coordinator, Galaxy Project, Johns Hopkins University

Monday July 20, 2020 10:01 - 10:15 EDT
Joint

10:15 EDT

West Keynote 2: Open minds bring open collaborations
βž”  Slides, Abstract

Prashanth N Suravajhala

  1. Birla Institute of Scientific Research, Statue circle, Jaipur, India
  2. Bioclues.org, India

Post COVID-19 times has ushered a fierce competition to deliver, be it vaccine or funding or publication. As researchers, we have a fair conception to be guided by reasons not emotions amid β€˜publish or perish’ adage. On the other hand, multitasking research and publishing has become a noticeable goal, but combining these tasks over time has become the need of the hour. In today’s reserved funding situation, many early/mid-career researchers face a daunting task to establish and develop their research programs, for example starting own labs crowdsourcing or obtaining funds from their previous associations/host institutions and publish it. But to what extent are we trying to preserve the fairness or integrity of science? I would like to draw your attention to β€˜Hippocratic Oath for Scientists’, which would ensure keeping the research vitality in the best interests of science to sustain excellence. Towards this, the talk would delve on how the three Cs, viz. Consistency, Continuity and Credibility augur well for a successful open organization. This would invariably bring successful Collaborations, Convergence, and importantly Control over mind to the fore. The growth of an individual or organization depends on fostering commitment to open culture, net neutrality and universal access to information in education and science fields. So, it is the Collaborative index (C-index) that matters. Are we ready?


This session will be introduced by Dave Clements.

Speakers
avatar for Prashanth Suravajhala

Prashanth Suravajhala

Senior Scientist and Founder, Bioclues.org, Birla Institute of Scientific Research; Bioclues
Prashanth N Suravajhala is a senior scientist at Birla Institute of Scientific Research, Jaipur. A PhD in Systems Biology, he went on to gain more than 7 years of postdoctoral experience across four different laboratories. He has interests exploring the known unknown regions in the human genome, primarily... Read More →



Monday July 20, 2020 10:15 - 11:00 EDT
Joint
  Meeting-West

11:15 EDT

BOSC West Session 3: Building Open Source Communities (BOSC) 🍐
Accepted talks and lightning talks.

Moderators
avatar for Yo Yehudi

Yo Yehudi

Software Developer, University of Cambridge & Open Life Science
Integrated genomic data (InterMine)

Monday July 20, 2020 11:15 - 12:00 EDT
BOSC
  Meeting-West

11:15 EDT

Galaxy West Session 3: Galaxy Administration πŸŒ€
Accepted talks and lightning talks.

Moderators
Monday July 20, 2020 11:15 - 12:00 EDT
Galaxy
  Meeting-West

11:16 EDT

Building open source communities and empowering new contributors 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Yo Yehudi 1, AdriΓ‘n Bazaga 12, Daniela Butano 1, Rachel Lyne 1, K.H. Reierskog 1,
InterMine Collaborators, Gos Micklem 1

1 Department of Genetics, University of Cambridge, Cambridge, United Kingdom
2 STORM Therapeutics Ltd, Cambridge, United Kingdom

Project Website: http://intermine.org/
Source Code: https://github.com/intermine/internships
License: CC-BY + Apache - https://github.com/intermine/internships/blob/master/README.md

Background: Open source software is a project where source code is open for redistribution,
modification, and which doesn't restrict how the software can be used
(https://opensource.org/osd). Many open source projects take the meaning of open source far
beyond this definition by building structured communities that facilitate contributions to the code
base, documentation, and design of the software. We wil share our experiences from building
community interactions into InterMine (an open source biological data warehouse).

Internship programs: Joining open source communities can often be a chal enge to
newcomers, who may not be aware of unwritten rules, community norms, and expectations. To
help change this, projects like InterMine participate in structured long-term programs to help
onboard newcomers. Two programs of note in this domain are Google Summer of Code (also
known as GSoC, https://summerofcode.withgoogle.com/) and Outreachy
(https://www.outreachy.org/).

Unpaid initiatives: Hacktoberfest (https://hacktoberfest.digitalocean.com/) is a month-long
drive to incentivise contributions to open source software. With the "first timers only" initiatives
(https://www.firsttimersonly.com/) InterMine curates, describes, and tags easier issues to make
them extra-friendly for beginners, creating a low-barrier on-ramp for its contributors.

Practical benefits: InterMine has been mentoring interns recruited via GSoC and Outreachy on
a yearly basis since 2017 and is doing so again in 2020. Over this time we have had tangible
production-ready practical benefits from the projects our interns have worked on, including a
registry for listing public instances of our software (http://registry.intermine.org/) and upgraded
SOLR search functionality
(https://intermineorg.wordpress.com/2018/11/15/intermine-3-0-solr-search/).

Contributors are offered benefits such as sponsored conference and hackathon attendance,
community-branded "swag", and recommendations for university and job applications.

Year-on-year, we find interns and Hacktoberfest contributors tend to return in later years in
many ways: as mentors, to offer technical support for their work, and even joining as staff.

Summary: Scientific and research software can strongly benefit from embracing open source
community models and initiatives, gaining both completed practical projects and a greater pool
of skil ed contributors. Thoughtful y designed pathways enable contributors to engage and stay
involved in the longer term, even when contributors themselves come from non-scientific
backgrounds.

Speakers
avatar for Yo Yehudi

Yo Yehudi

Software Developer, University of Cambridge & Open Life Science
Integrated genomic data (InterMine)



Monday July 20, 2020 11:16 - 11:30 EDT
BOSC

11:16 EDT

The cloud-native Galaxy: Galaxy on Kubernetes πŸŒ€
➞ Abstract

Alexandru Mahmoud 1, Nuwan Goonasekera 2, Pablo Moreno 3, John Chilton 4, Marius van den Beek 4, Enis Afgan 1

  1. Johns Hopkins University, Baltimore, MD, USA
  2. Melbourne Bioinformatics, University of Melbourne, Victoria 3010, Australia
  3. The European Bioinformatics Institute (EMBL-EBI), Cambridgeshire, United Kingdom
  4. Penn State University, State College, PA, USA

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Alexandru Mahmoud

Alexandru Mahmoud

Galaxy Team, Johns Hopkins University



Monday July 20, 2020 11:16 - 11:30 EDT
Galaxy

11:30 EDT

Codeathons as a tool for improving diversity in computer science 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

ALLISSA DILLMAN 1, RANA MORRIS 2, PETER COOPER 3, ERIC SAYERS 4, BART TRAWICK 5

1 Allissa Dillman, NCBI/NLM/NIH, Bethesda MD 20892 allissa.dillman@nih.gov
2 Rana Morris NCBI/NLM/NIH, Bethesda MD 20892
3 Peter Cooper NCBI/NLM/NIH, Bethesda MD 20892
4 Eric Sayers NCBI/NLM/NIH, Bethesda MD 20892
5 Bart Trawick NCBI/NLM/NIH, Bethesda MD 20892

Project Website: https://ncbi-codeathons.github.io/
Source Code: https://github.com/topics/womenled-nih-2019
License: MIT License

Women are underrepresented in computer science, accounting for only ~18% of the population
receiving degrees in this field. These numbers have been dropping since the 1980s when female
representation was at ~37%. A perceived lack of experience and of few opportunities of female
mentorship are often cited as barriers to women entering computationally intensive fields. Hackathons
are one place where early career computer scientists can explore their creativity and code as part of a
team. Additionally, these events also allow the opportunity to network with others in the biological,
data, and computer science fields, improving representation throughout career stages and creating
opportunities to find novel mentors. Finally, hackathons are a great way to learning new skills, tools and
technologies on the fly from peers. However, hackathons typically also have a gender gap that reflects
the overall participation rate in computer science, with only around 20-25% of participants being
female. Our goal was to facilitate collaboration among communities in science and technology who may
often not interact and to increase the representation of women in computer science activities. To this
end, we created the women-led biodata science codeathon, an event with all-female organization and
leadership and where team projects were proposed, led, developed and presented by women. The
event itself was held May 8-10, 2019 on the National Institutes of Health main campus in Bethesda
Maryland. We had forty-six women from 11 NIH institutes, 10 universities, two consulting firms, two
industrial companies, and a software company work together as teams on eight projects using cloud
infrastructure provide free of charge by the National Center for Biotechnology Information. The majority
of our participants were first time hackathoners and many of them cited the fact that this event was
women-led as the reason for their interest. The event was so successful several teams continue to
collaborate on their codeathon projects, through on-going analysis, writing manuscripts, and working on
posters for upcoming conferences. Many women were asking for another iteration of the event before it
had even finished. The 2nd annual women-led BioData Science Codeathon at NIH will take place in the
fall of 2020. We are continuing to empower diverse coding, science and technologies groups with the
goal of creating more codeathons and other data and computational events that will encourage data
democratization for all.
Document Outline

Speakers


Monday July 20, 2020 11:30 - 11:35 EDT
BOSC

11:30 EDT

Custos: Enabling User Authentication via External Institutional Identities πŸŒ€
➞ Abstract

Juleen Graham 1, Dannon Baker 1, Isuru Ranawaka 2, Alexandru Mahmoud 1, Terry Fleury 3, Suresh Marru 2, Marlon Pierce 2, Enis Afgan 1

  1. Johns Hopkins University, Baltimore, MD, USA
  2. Indiana University, Bloomington, IN, USA
  3. University of Illinois, Urbana, IL, USA

The presenter(s) will be available for live Q&A at the end of this session (BCC West)

Speakers
JG

Juleen Graham

Johns Hopkins University



Monday July 20, 2020 11:30 - 11:35 EDT
Galaxy

11:35 EDT

CyVerse Learning Institute’s foundational open science skills workshop 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

TysonLeeSwetnam

UniversityofArizona,TucsonAZ.Email:tswetnam@arizona.edu

ProjectWebsite:https://learning.cyverse.org/projects/foss-2020/en/latest/
SourceCode:https://github.com/CyVerse-learning-materials/foss-2020
License:CCBY4.0

Abstract
CyVerse is a research cyberinfrastructure funded by the National Science Foundation’s Directorate for Biological Sciences. CyVerse provides life scientists with computational infrastructure to handle big datasets and complex analyses, thus enabling data-driven discovery. Principal investigators have reported that access to computing resources is not the bottleneck to data-driven discovery, rather the requisite skills in utilizing cyberinfrastructure and access to training are the most limiting. Our β€œFoundational Open Science Skills (FOSS)” was designed as a weeklong, camp-style training to address these problems. The focus of FOSS is on computational research strategies, full lifecycle data management, the FAIR data principles, collaboration skills, and using open-source software. FOSS prepares researchers to meet the growing expectations of funding agencies, publishers, and research institutions for scientific reproducibility, data accessibility, and advanced analytics. In this talk, I will discuss our lessons learned, how participants become familiar with productivity software for organizing their data science lab group, communications, and research; and how we approach teaching computational skills from laptop to cloud and high-performance computing (HPC) systems. In the last twelve months, FOSS has been taught twice to over forty early career researchers. Participants have gone on to begin their tenure-track positions, conduct funded research, written new proposals utilizing FOSS techniques and have won competitive grant awards. To contribute back to the community, we have placed our training materials online in GitHub in ReadTheDocs format, where anyone can learn from them or contribute back to the project.

Speakers
avatar for Tyson Swetnam

Tyson Swetnam

Research Assistant Professor, University of Arizona
I work for CyVerse.org. Lately, I've been developing containerized workflows for use in cyberinfrastructure in life and earth science. Β If you're interested in foundational open science skills or learning more about using free research computing come talk to me!



Monday July 20, 2020 11:35 - 11:40 EDT
BOSC

11:35 EDT

On-demand Galaxy with Laniakea: results and future perspectives πŸŒ€
➞ Abstract

Tangaro Marco Antonio 1, Donvito Giacinto 2, Antonacci Marica 2, Chiara Matteo 3, Mandreoli Pietro 3, AlverΓ  Martina 3, Pesole Graziano 1,4, Zambelli Federico 1,3

  1. Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (CNR), Bari, Italy
  2. National Institute for Nuclear Physics, Bari Section, Italy
  3. Dept. of Biosciences, University of Milan, Italy
  4. Dept. of Biosciences, Biotechnologies and Pharmacological Sciences, University of Bari, Italy

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
MA

Marco Antonio Tangaro

Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies - National Research Council, Bari, Italy



Monday July 20, 2020 11:35 - 11:40 EDT
Galaxy

11:40 EDT

HiSCiAp and Human Cell Atlas Galaxy instance: User-friendly, scalable tools and workflows for single-cell analysis πŸŒ€
➞ Abstract

Moreno, P. 1, Huang, N. 1,2, Manning, J.R. 1, Mohammed S. 1, Solovyev A. 1, Polanski, K. 2, Chazarra, R. 1, Talavera-Lóopez, C. 1,2, Doyle, M. 3,4, Marnier, G. 1, Grüning, B. 5, Rasche, H. 5, Miao, C. 1, Bacon, W. 1, Perez-Riverol, Y. 1, Haeussler, M. 6, Brazma, A. 1, Meyer, K.B. 2, Teichmann, S. 2, Papatheodorou, I. 1

  1. EMBL-EBI
  2. Wellcome Sanger Institute
  3. Research Computing Facility, Peter MacCallum Cancer Centre, Melbourne, Victoria 3000, Australia
  4. Sir Peter MacCallum Department of Oncology, The University of Melbourne, Victoria 3010, Australia
  5. U. of Freiburg
  6. Genomics Institute, University of California at Santa Cruz, 1156 High Street, Santa Cruz 95064, USA

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Pablo Moreno

Pablo Moreno

EMBL-EBI European Bioinformatics Institute



Monday July 20, 2020 11:40 - 11:55 EDT
Galaxy

11:40 EDT

Open Life Science: Empowering early career researchers to become open science leaders 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

By the Open Life Science team members: BΓ©rΓ©nice Batut, Malvika Sharan, Yo Yehudi

Project Website: http://openlifesci.org/
Source Code: https://github.com/open-life-science/open-life-science.github.io 
License: CC-BY for all training and mentoring materials, CC-BY-SA for the website content

Motivation: As scientists, we are provided training and guidance in how to conduct research in the lab, design algorithms, analyse data and publish them. However, scientists are rarely expected to apply important skills such as open science principles for tooling and road mapping their projects, planning reproducible workflows, involving others in their work, and leading an inclusive community. Modern bioinformatics communities stand in the interface of computational and biological research. This interdisciplinary position requires us to develop collaborative projects by implementing such β€œopen by design” principles in our research projects systematically -- skills that aren’t necessarily taught at university or graduate school level.

About the project: Open Life Science (OLS) is a volunteer-driven training and mentoring program aimed at empowering early career researchers and potential academic leaders to become open science ambassadors. Participants join OLS with a proposal to work on an open science project and attend a series of one-on-one mentoring calls over 16 weeks, alternating with full cohort calls that provide training on specific open science and leadership skills. OLS’s work is underpinned by a community of over 50 mentors and expert guest speakers.

Cohort calls cover a broad spectrum of topics relevant to leading an open project, ranging from open science topics, community building, project and contribution management of GitHub repositories, and caring both for yourself and others in your community. Calls are designed to be interactive and engaging, utilising a mix of Zoom’s break-out room features to facilitate group discussion, collaborative document editing, and guest speakers from academia and industry giving short talks. The program is modelled on the exact principles we teach, and hence, all materials, including syllabus, call notes, and slides, are shared under the CC-BY licence. Cohort calls are recorded and shared openly on YouTube. Third-party organisations and individuals are encouraged to fork, remix and re-use materials.

Overview of the first round: OLS’s first cohort (OLS-1), known as β€œOpen Seeds”, was conducted from January 2020 until May 2020 with 29 project leaders working on 20 projects. Project leaders came from around the world, including the Netherlands, Spain, Norway, Japan, India, Nepal, Thailand, Kenya, Brazil, Russia, Canada, the United Kingdom, and the United States. At the end of the program, the project leaders graduate by presenting their work, share their mentorship experience and discuss their future plans on publicly live-streamed video calls.

In this talk, we will report important observations and outcomes from running the first cohort of our mentoring and training program. At the time of writing, OLS-1 is in final stages of wrap-up and graduation, and we aim to open applications for OLS-2 in May 2020. We will also welcome new mentors and experts, including the project leaders from OLS-1, who will be encouraged to return to join the mentor and expert teams for OLS-2.

Speakers
avatar for BΓ©rΓ©nice Batut

BΓ©rΓ©nice Batut

Post-doc, University of Freiburg
avatar for Yo Yehudi

Yo Yehudi

Software Developer, University of Cambridge & Open Life Science
Integrated genomic data (InterMine)
avatar for Malvika Sharan

Malvika Sharan

Senior Researcher, The Alan Turing Institute
I am a senior researcher for the Tools, Practices and Systems research programme at The Alan Turing Institute, London. With a focus on Open Research, I lead a team of community managers and co-lead The Turing Way project that aims to make data science reproducible, collaborative... Read More →



Monday July 20, 2020 11:40 - 11:55 EDT
BOSC

11:55 EDT

Q & A πŸŒ€
Question and Answer session for the just finished talks.

Moderators
Monday July 20, 2020 11:55 - 12:00 EDT
Galaxy

11:55 EDT

Q&A 🍐
The presenter(s) will be available for live Q&A in this session.

Moderators
avatar for Yo Yehudi

Yo Yehudi

Software Developer, University of Cambridge & Open Life Science
Integrated genomic data (InterMine)

Monday July 20, 2020 11:55 - 12:00 EDT
BOSC

12:15 EDT

Sponsor Session West
Learn more about BCC2020 Sponsors. Sponsors make this event possible and affordable, and are potential partners for your research.

Moderators
avatar for Dave Clements

Dave Clements

Training and Outreach Coordinator, Galaxy Project, Johns Hopkins University

Monday July 20, 2020 12:15 - 13:00 EDT
Joint
  Meeting-West

12:20 EDT

AWS Gold Sponsor Talk: Scalable genomics data analysis in the cloud
β†’ Slides

Abstract:
The amount of raw genomics data is continuously growing with some estimating that the amount of data world wide is on the order of Exabytes. Processing such mountains of FASTQs into science ready formats like VCFs, expression matrices, etc is no trivial task and requires workflow architectures that can scale in both performance and cost efficiency. The cloud offers practically unlimited compute capacity, elasticity, and flexibility to process enormous amounts of genomics data cost effectively and on-demand. In this talk, we’ll highlight the core patterns, architectures, and tooling used by many genomics customers who are leveraging the cloud to tackle their biggest genomics data processing challenges.

Sponsorship:
Amazon Web Services is a Gold Level sponsor of BCC2020.  Lee Pang is also giving this talk in BCC East. AWS is used in the research behind several presentations at BCC2020.

Speakers
avatar for Lee Pang

Lee Pang

Amazon Web Services
Lee is a Principal Bioinformatics Architect with the Health AI services team at AWS. He has a PhD in Bioengineering and over a decade of hands-on experience as a practicing research scientist and software engineer in bioinformatics, computational systems biology, and data science developing tools ranging from high throughput pipelines for *omics data processing... Read More →




Monday July 20, 2020 12:20 - 12:40 EDT
Joint

12:40 EDT

Software Sustainability Institute Gold Sponsor Talk: Most code is cr#p and that’s okay
Abstract
The COVID-19 pandemic has shone a spotlight on research software, with everyone from scientists to sceptics scrutinising the codes and models used to inform policy and interventions. This has highlighted a gap in expectations of the quality of this software – but what is the reality and what is good enough? What should we strive for as the people using software to power our research?

In this talk, I’ll discuss some of the challenges of developing reusable research software, and why collaboration and openness are the strongest tools to improve the quality of your code.

Sponsorship:
The Software Sustainability Institute has been working for a decade in this area, encouraging and enabling researchers and research software engineers to live up to our slogan of β€œbetter software, better research”.  The Software Sustainability Institute is a Gold Level sponsor of BCC2020.


Speakers
avatar for Neil Chue Hong

Neil Chue Hong

Director, Software Sustainability Institute, University of Edinburgh
Neil Chue Hong is the founding Director and PI of the Software Sustainability Institute and a Senior Research Fellow at EPCC, based at the University of Edinburgh. He graduated with an MPhys in Computational Physics, also from the University of Edinburgh. He completed an internship... Read More →



Monday July 20, 2020 12:40 - 13:00 EDT
Joint

13:00 EDT

eLife Innovation Sponsor Table
eLife works to improve research communication through open science and open technology innovation.

eLife is a non-profit organisation inspired by research funders and led by scientists. Our mission is to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science.

eLife sponsored childcare at the 2018 joint conference, and again at the 2019 Galaxy Conference. This year eLife is sponsoring closed captioning for conference talks.

Please stop by and learn more about eLife. We are located on the first floor of the Poster / Demo building,

Speakers
avatar for Emmy Tsang

Emmy Tsang

Innovation Community Manager, Delft University of Technology



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

GigaScience Sponsor Table
GigaScience is an online open access, open data, open peer-review journal published by Oxford University Press and BGI. The journal offers β€˜big data’ research from the life and biomedical sciences, and on top of 'Omics research includes the growing range of work that uses difficult-to-access large-scale data, such as imaging, neuroscience, ecology, systems biology, and other new types of shareable data. GigaScience is unique in the publishing industry as it publishes all research objects (data, software tools, source code, workflows, containers and other elements related to the work underpinning the findings in the article). Promoting Open Science, all published software needs to be under an OSI-license, all supporting data must be available and open, and all peer review is carried out transparently. Presenting workflows via our GigaGalaxy.net server, novel work presented at the meeting utilising Galaxy is eligible to a 15% APC if it is submitted to our Galaxy series.

Please stop by and learn more about GigaScience. We are located on the first floor of the Poster / Demo building,

Speakers
avatar for Ken Cho

Ken Cho

Systems Programmer Analyst, GigaScience
avatar for Scott Edmunds

Scott Edmunds

Editor in Chief, GigaScience Press/BGI Hong Kong
Scott Edmunds is the Editor in Chief of GigaScience Press. With over 15 years experience in Open Access and Open Data publishing he is co-founder of CivicSight (formerly Open Data Hong Kong) and CitizenScience.Asia, and is on the Board of Directors of the Dryad Digital Repository... Read More →
avatar for Laurie Goodman

Laurie Goodman

Publishing Director, GigaScience Press
Laurie Goodman, PhD, is the Publishing Director for GigaScience Press, which publishes the international, open-science journals GigaScience and GigaByte. Both journals have won awards for Innovation in publishing. Dr. Goodman received her BS and MS from Stanford University in 1986... Read More →



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P1-05: : Automated generation of training materials from markdown documents πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Delphine Lariviere

Delphine Lariviere

Penn State University
Post-doc in the Galaxy Team (Nekrutenko Lab). Works on bacterial genomics, assembly, RNA Seq, TnSeq. Also interested in evolution, metagenomics, epigenetics and visualisation.



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P1-06: : Automated real-time data analysis and visualizations for the SARS-CoV-2/Covid19 portal πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Marius van den Beek

Marius van den Beek

Penn State University



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P1-09: : BioViz Connect: Web application linking CyVerse cloud resources to genomic visualization in the Integrated Genome Browser 🍐
➞ Abstract

This poster will be presented live at BCC West.

Advances in high throughput sequencing have increased the need for tools that aid in data storage,

analysis, annotation, and visualization. Many such tools are available, but their usability and

accessibility vary. To make essential tools more accessible, the bioinformatics community has

coalesced around the idea of using cloud-based platforms to provide access to computational power

and data storage resources. CyVerse is a multi-institution project focused on supporting life science

research by providing user-friendly access to national cyberinfrastructure resources, including HPC

clusters and storage infrastructure. As part of this effort, CyVerse developers built the Terrain

Application Programmer Interfaces (APIs), which offer programmatic access to these resources.

One important limitation of the CyVerse ecosystem, however, is that there is currently no easy way

for researchers to visualize genomic data sets stored in CyVerse accounts. This is problematic

because visualization is essential for all aspects of data analysis, from validating the output of

algorithms to detecting biologically meaningful patterns in data.

BioViz Connect solves this problem by connecting CyVerse resources to Integrated Genome

Browser, a full-featured, open source, visualization tool for genomics used by thousands of

researchers worldwide. BioViz Connect uses Terrain APIs to forward data from CyVerse into IGB.

The BioViz Connect interface (Figure 1) lets users annotate data files with key meta-data, notably

the version of reference genomes used in data analysis. Users can also run compute-intensive visual

analytics tasks and then display the results in IGB. To our knowledge, no other group has yet

experimented with using Terrain for application development outside of the CyVerse team.


Speakers
avatar for Nowlan Freese

Nowlan Freese

Research Associate, UNC Charlotte



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P1-12: : Codeathons as a tool for improving diversity in computer science 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers

Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P2-01: : Community genome annotation integrates with Galaxy via Apollo providing greater integration and more functional annotation options πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Nathan Dunn

Nathan Dunn

Software Developer, Lawrence Berkeley National Lab



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P2-05: : CrowdGO: a wisdom of the crowd-based Gene Ontology prediction tool 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
MR

Maarten Reijnders

Department of Ecology and Evolution, University of Lausanne


Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P3-03: : EDAM: the ontology of bioinformatics operations, topics, data, and formats (update 2020) 🍐
➞ Abstract

This poster will be presented live at BCC West on Monday, and East on Tuesday.


MatúΕ› Kalaš 1, Hervé Ménager 2, Alban Gaignard 3, Veit Schwämmle 4, Jon Ison 5, and the EDAM contributors and advisors

1. University of Bergen, Norway
2. Institut Pasteur, Paris, France
3. Univerity of Nantes, France
4. University of Southern Denmark, Ødense, Denmark
5. French Institute of Bioinformatics (ELIXIR France)

Project website: https://edamontology.org
Source code: https://github.com/edamontology/edamontology
License: CC BY-SA 4.0

EDAM is an ontology of well-established, familiar concepts that are prevalent within bioinformatics, and bioscientific data analysis in general [1,2]. The scope of EDAM includes types of data and data identifiers, data formats, operations, and topics. EDAM has a relatively simple structure, and comprises a set of concepts with terms, synonyms, definitions, relations, links, persistent identifiers, and some additional information (especially for data formats).

EDAM is developed in a participatory and transparent fashion, within a growing international community of contributors. The development of EDAM is coordinated with the development and curation of tools registries (e.g. bio.tools and BIII.eu); registries of training materials (e.g. TeSS); with packaging of open-source bioinformatics software (especially Debian Med [3]); the Common Workflow Language [4]; and other related communities and initiatives. These include the developers’ community of Galaxy [5], and collaborations with specialised networks of experts, such as within the development of EDAM-bioimaging [6]. EDAM-bioimaging is an extension of EDAM towards bioimage informatics and machine learning, where a broad group of experts in bioimaging, image analysis, and deep learning has been contributing to the common effort. The comprehensive but concise inclusion of machine learning topics is one of the new additions in 2020.The latest release of EDAM at the time of publication was version 1.24 [7], and EDAM-bioimaging version alpha06 [8].

In summary, EDAM functions as common controlled vocabulary when publishing, sharing, and integrating information about bioinformatics tools, workflows, training materials, and other resources. In addition, EDAM is also useful when choosing terminology, for data provenance, and in text mining (e.g. EDAMmap).

Poster published in F1000Research on 6 Jun 2020. https://doi.org/10.7490/f1000research.1117983.1
Video presentation: https://youtu.be/Jq16bnq8kbk


Speakers
avatar for MatΓΊΕ‘ Kalas

MatΓΊΕ‘ Kalas

Senior Engineer, University of Bergen
Attending GCC2022 virtually 🌌Working on open science|source|society|education, EDAM ontology, ELIXIR Norway, Bio.tools, ...



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P3-06: : eSPiGA: a population genomic analyses package with graphical interface 🍐
➞ Abstract

This poster will be presented live at BCC West.


Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P4-01: : Functionally Assembled Terrestrial Ecosystem Simulator (FATES) with Community Land Model in Galaxy πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Anne Fouilloux

Anne Fouilloux

Research Software Engineer, University of Oslo
I am working on Galaxy Climate (development of tools, integration of climate data, training material).
HT

Hui Tang

University of Oslo, Department of Geosciences
SG

Sonya Geange

Department of Biological Sciences, University of Bergen, Norway



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P4-05: : Goslin - A grammar of succinct lipid nomenclature 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
NH

Nils Hoffmann

Leibniz-Institut fΓΌr Analytische Wissenschaften – ISAS – e.V.


Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P4-09: : HiSCiAp and Human Cell Atlas Galaxy instance: User-friendly, scalable tools and workflows for single-cell analysis πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Pablo Moreno

Pablo Moreno

EMBL-EBI European Bioinformatics Institute



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P4-11: : Implementation of the IEEE-2791-2020 standard (BioCompute Objects) in Galaxy via workflow invocations πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Charles Hadley King

Charles Hadley King

Senior Research Associate, George Washington University



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P5-01: : Integrating refgenie and Galaxy for reference data management: a proposal for IDC πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
IE

Ignacio Eguinoa

ELIXIR Belgium - VIB Center for Plant Systems Biology



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P5-03: : Jasmine: Fast and accurate structural variant comparison across many individuals 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers

Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P5-08: : OpenBio.eu: An extrovert bioinformatics research object repository and workflow management system 🍐
➞ Abstract

This poster will be presented live at BCC West.


Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

13:00 EDT

P6-06: : sRNAflow: a tool for analysis of small RNA-seq data 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers

Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P6-10: : The cloud-native Galaxy: Galaxy on Kubernetes πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Nuwan Goonasekera

Nuwan Goonasekera

University of Melbourne
avatar for The Other Enis Afgan

The Other Enis Afgan

Research scientist, Johns Hopkins University
avatar for Alexandru Mahmoud

Alexandru Mahmoud

Galaxy Team, Johns Hopkins University



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P7-01: : ViPRA-Haplo: de novo reconstruction of viral populations using paired end sequencing data) 🍐
➞ Abstract

This poster will be presented live at BCC West.

Viruses replicating within a host exist as a collection of closely related genetic variants known as viral haplotypes. The diversity in a viral population, or quasispecies, is due to mutations (insertions, deletions or substitutions) or recombination events that occur during virus replication. These haplotypes differ in relative frequencies and together play an important role in the fitness and evolution of the viral population. This variation in viral sequences poses a challenge to vaccine design and drug development. We present ViPRA-Haplo, a de novo assembly algorithm for reconstructing viral haplotypes in a virus population from paired-end next generation sequencing (NGS) data. The proposed Viral Path Reconstruction Algorithm (ViPRA) generates a subset of paths from a De Bruijn graph of reads using the pairing information of reads. These paths represent contigs of the virus. The paths generated by ViPRA are an over-estimation of the possible contigs. We then propose two methods to obtain an optimal set of contigs representing the viral haplotypes. The first method uses VSEARCH to cluster the paths reconstructed by ViPRA. The centroid in each cluster represents a contig. Second, we proposed a method MLEHaplo that generates a maximum likelihood estimate of the viral populations using the ViPRA paths. We evaluate and compare ViPRA-Haplo on a simulated data set, on a real HIV MiSeq data set (SRR961514) with sequencing errors, and on an emerging SARS-CoV-2 real data set (SRR10903401). In the simulated data, ViPRA-Haplo reconstructs full length viral haplotypes having a 99.7% sequence identity to the true viral haplotypes at 250x sequencing coverage. In the real NGS data, error correction software Karect is used to improve de novo assembly. The real HIV data set contains 714,994 pairs (2x250 bp) of reads that cover the five strains to 20,000x. Our method can reconstruct contigs that cover over 90% of each strain of the reference genomes, which is higher than the benchmark tool PEHaplo.  In the SARS-CoV-2 data, after filtering for SARS-CoV-2 contigs using the metagenomic classifier Centrifuge, the contigs reconstructed by our method cover over 99% of the reference genome.  The comparisons on both simulated and real data show that ViPRA-Haplo outperforms the existing tools by a higher coverage in reference genome(s), and in retaining the variation in viral sequence present naturally in the viral population.


Speakers
WL

Weiling Li

postdoc, Indiana University - Bloomington



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P7-03: : Visualization of gene regulation in endothelial cells that are programmed to human pluripotent stem cells and differentiated to endothelial or neuronal cells 🍐
➞ Abstract

This poster will be presented live at BCC West.

Human pluripotent stem cells, derived from embryos or fetal tissue, are providing new opportunities to understand changes in gene regulation. Here we introduce a visualization tool that can be used to investigate how physical locations of genes on a chromosome relate to changes in significant gene expression by cell types. We present the results of an experiment in which induced pluripotent stem cells (iPs) were generated from human umbilical vein endothelial cells (HUVEC), and then differentiated back into endothelial cells (EC-Diff) as well as into neuronal cells (Nn-Diff). Our tool encodes significant changes in gene expression and allows for investigation of genes by their ontological classification and physical location. Observing the relationship between location, gene ontology, and expression level across cell types can assist in the identification of patterns in gene regulation changes. This novel tool can shed light on why stem cells differentiate into one cell type over another, with applications for modeling and treatment in the realms of neurodegeneration and cardiovascular disease. Our results have the potential to bridge the gap between complicated datasets resulting from experiments on cells, and biologists with the domain knowledge to properly draw conclusions about pathway activation.


Speakers
avatar for Luke Trinity

Luke Trinity

PhD Student, University of Victoria
Luke Trinity is a Washington, D.C. native. He is pursuing a PhD in Computer Science at the University of Victoria with a focus in bioinformatics. Previously, Luke completed his B.A. in Computer Science, and M.S. in Complex Systems & Data Science at the University of Vermont.


Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P7-07: Modern Data Visualization in Galaxy with Higlass πŸŒ€
➞ Abstract

This poster will be presented live at BCC West and BCC East

Speakers
avatar for Alex Ostrovsky

Alex Ostrovsky

Johns Hopkins University



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

P7-09: Leveraging machine learning techniques and microbiota data for accessible and reproducible biological classification πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Jayadev Joshi

Jayadev Joshi

Postdoctoral researcher, Cleveland Clinic



Monday July 20, 2020 13:00 - 13:45 EDT
Joint

13:00 EDT

Poster / Demo West Session 2
The second poster and demo session of BCC2020.

Access the Poster / Demo hall through the "Go to Posters" button at the top left in the main BCC2020 Remo conference space.

Monday July 20, 2020 13:00 - 13:45 EDT
Joint

14:00 EDT

BOSC West Session 4a: Developer tools & libraries 🍐
Accepted talks and lightning talks.

Moderators
avatar for Karsten Hokamp

Karsten Hokamp

Bioinformatics Support, Trinity College Dublin

Monday July 20, 2020 14:00 - 14:30 EDT
BOSC
  Meeting-West

14:00 EDT

Galaxy West Session 4: The big picture πŸŒ€
Accepted talks and lightning talks.

Moderators
Monday July 20, 2020 14:00 - 15:00 EDT
Galaxy
  Meeting-West

14:01 EDT

Applying Pulsar: a democratised national and international remote compute network for Galaxy πŸŒ€
➞ Abstract

Gianmauro Cuccuru 1, Marco Antonio Tangaro 4, Bjârn Grüning 1, Nate Coraor 3, John Chilton 3, Simon Gladman 2

  1. UseGalaxy.eu, Bioinformatics Group, Department of Computer Science, Albert-Ludwigs-University Freiburg, Germany; 
  2. Galaxy Australia, Melbourne Bioinformatics, University of Melbourne, Australia; 
  3. Galaxy Project, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA; 
  4. Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies - National Research Council (IBIOM-CNR), Bari, Italy

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Gianmauro Cuccuru

Gianmauro Cuccuru

University of Freiburg



Monday July 20, 2020 14:01 - 14:15 EDT
Galaxy

14:01 EDT

Shesmu: A bioinformatics orchestration tool 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Andre P. Masella 1, Heather E. Armstrong 2, Iain Bancarz 2, Dillan J. Cooke 2, Michael Laszloffy 2,
Angie Mosquera 2, Alexis V. Varsava 2, Morgan Taschuk 3

1 Ontario Institute for Cancer Research, Toronto, Canada. Email: andre.masella@oicr.on.ca
2 Ontario Institute for Cancer Research, Toronto, Canada.
3 Ontario Institute for Cancer Research, Toronto, Canada. Email: morgan.taschuk@oicr.on.ca

Project Website: https://oicr-gsi.github.io/shesmu
Source Code: https://github.com/oicr-gsi/shesmu
License: MIT License


Main Text of Abstract

In the ten years that Genome Sequence Informatics group has existed at OICR, our production
infrastructure and automation grew from a few small scripts to an unmanageable collection of
scripts in various languages, cron jobs, and servers. Each additional workflow brings along a new
collection of scripts that must orchestrate running that workflow. As we now have over 30
workflows, this creates additional load even though most jobs are doing similar work. Data that
fails to be picked up by the next workflow is difficult to track and errors in scripts cause delays in
the whole pipeline. Because each system was developed independently, debugging and logging is
inconsistent, if available at all. Many of these pieces of software started as clones, but divergence
over time makes it hard to apply bug fixes consistently across them. We created Shesmu as a way to
consolidate and simplify orchestration workflow scheduling, ticketing, data release, and QC
validation.

Shesmu ingests data from many user-specified tabular data sources and feeds it to olives, small SQL-
like programs that filter and group data, in order to produce actions. Actions communicate with
external systems to accomplish their task given the parameters provided by the olive. The standard
distribution includes plugins to integrate with several external systems, including Atlassian's JIRA,
remote servers via SSH, MongoDB, Prometheus, and GitHub as well as our internally developed
Niassa workflow engine, Pinery LIMS interface, and Guanyin reporting system. We have used
Shesmu to automate running Niassa and WDL workflows, generating reports, updating our QC data
warehouse, notifying operators about invalid data, requesting the lab enter missing required data,
and informing the lab of the current analysis progress.

Shesmu has slashed automation time from over two weeks to a few days. Additionally, Shesmu runs
faster, provides better feedback to developers, and allows easier control for operators. The reduced
development time for olives has also reduced the need for operators to run workflows manually; it
is faster and easier to fix an olive and redeploy it then it is run the workflow manually.
Furthermore, while writing workflow launching required experience with Java and the API of the
workflow engine, the simplified domain language for writing olives has increased the number of
developers by lowering the barrier to entry. Shesmu's memory requirements are consistent and
system scales very well; we went from 10 olives in March 2019 to 102 olives in March 2020 with
very little change in resource usage.

Shesmu has replaced many of our launchers and data processing cron jobs with olives that run
faster and provide a better experience for both our operators and developers.

Speakers
avatar for Andre Masella

Andre Masella

Sr. Software Developer, Ontario Institute for Cancer Research
I'm a programmer at Genome Sequence Informatics at the Ontario Institute for Cancer Research, supporting pipeline infrastructure and maintenance projects for our automated analysis.



Monday July 20, 2020 14:01 - 14:15 EDT
BOSC

14:15 EDT

SIMD Everywhere: portable implementations of SIMD intrinsics 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Evan Nemerson 1, Hidayat Khan 2, Himanshi Mathur 3, Jun Aruga 4, Michael R. Crusoe 5

1 Unaffiliated; San Diego, CA, USA. Email: evan@nemerson.com
2 IIIT Lucknow; Bhopal, India.
3 IIIT Delhi; Delhi, India.
4 Red Hat; Brno, Czech Republic.
5 VU Amsterdam, DTL Projects, Debian, CWL; Berlin, Germany.

Project Website & Source Code: https://github.com/nemequ/simde/
License: MIT License

SIMD Everywhere (SIMDe) is a header-only C/C++ library which provides fast, portable
implementations of SIMD intrinsics on platforms which do not natively support them, such
as calling AVX functions on ARM, AltiVec/VMX,POWER, WebAssembly, or less powerful x86
processors.
The use of SIMD intrinsics provides a significant opportunity for optimization, but
the technique has traditionally been an "either/or" situation in bioinformatics: there was
either the highly optimized path plus a completely unoptimized code path, or some CPU
architectures were completely unsupported. The SIMD Everywhere library allows you to
target these advanced extensions while retaining support for machines which don't
support them. If the platform supports the intrinsics natively there is zero performance
penalty. On other platforms the SIMD Everywhere library will attempt to use intrinsics
which are available, such as those from another SIMD instruction set. If that doesn't work,
the SIMD Everywhere library will attempt to use compiler-specific functionality such as
__builtin_shuffle, __builtin_shufflevector, and __builtin_convertvector or autovectorization
hints such as OpenMP SIMD, Cilk Plus, GCC's ivdep, or clang's clang's loop pragma before
falling back on pure, standards-compliant C or C++.
With ARM64 HPC systems becoming more common, cost-effective ARM64 options
on AWS, and persistent rumors of forthcoming ARM laptops from Apple, ARM support is
becoming much more important to bioinformatics. Using the SIMD Everywhere library to
port to ARM, or any other architecture, often requires little more than including the
relevant header and defining a single macro. Furthermore, native calls to SIMD Everywhere
library functions can be mixed with native intrinsics, allowing you to add
manually-optimized implementations for particularly hot and/or poorly performing code
as necessary, without the need to port the entire project all at once.
SIMDe currently contains complete portable implementations for MMX, SSE, SSE2,
SSE3, SSSE3, SSE4.1, AVX, FMA, and the SIMD GFNI functions. Ongoing work includes AVX2,
SSE4.2, various AVX-512 extensions, and SVML. Two OBF sponsored Google Summer of
Code students will be working on the project this summer; Hidayat Khan will be focusing
on AVX2 and SSE2, and Himanshi Mathur will be focusing on SVML.
A number of bioinformatics packages are already using SIMDe, either natively or via
a patch from Debian: examl, last-align, python-skbio, minimap2, pbcopper, hisat2, vg,
fermi-lite, bowtie2, bwa, raxml, kalign, fasta3, plink2, mmseqs2.

Speakers


Monday July 20, 2020 14:15 - 14:20 EDT
BOSC

14:15 EDT

Galaxy Community Update πŸŒ€
➞ Abstract

The Galaxy Community 1

  1. A cast of thousands.

An update on what's happening with the Galaxy Community, around the world.

Presenter(s) will be available for live Q&A at the end of this session (BCC West).


Speakers
avatar for Simon Gladman

Simon Gladman

University of Melbourne
avatar for Gareth Price

Gareth Price

Head of Computational Biology, QCIF Facility for Advanced Bioinformatics
avatar for Bjorn Gruning

Bjorn Gruning

University of Freiburg
MS

Michael Schatz

Johns Hopkins University
avatar for Anton Nekrutenko

Anton Nekrutenko

Penn State University
Penn State University
avatar for Jeremy Goecks

Jeremy Goecks

Associate Professor of Biomedical Engineering and Computational Biology, Oregon Health & Science University
avatar for The Other Enis Afgan

The Other Enis Afgan

Research scientist, Johns Hopkins University
avatar for Daniel Blankenberg

Daniel Blankenberg

Assistant Professor, Genomic Medicine Institute, Cleveland Clinic Lerner Research Institute


Monday July 20, 2020 14:15 - 14:50 EDT
Galaxy

14:20 EDT

Biopython Project Update 2020 🍐
β†’ Abstract, Slides, Video

The presenter(s) will be available for live Q&A in this session (BCC West).

Peter Cock1 and the Biopython Contributors 2

1 Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, UK
2 See contributor listing on GitHub.

Website: http://biopython.org
Repository: https://github.com/biopython/biopython
License: Biopython License Agreement (BSD like, see http://www.biopython.org/DIST/LICENSE)

The Biopython Project is a long-running distributed collaborative effort, supported by the Open Bioin-
formatics Foundation, which develops a freely available Python library for biological computation [1]. This
talk will look ahead to the year to come, and give a summary of the project news since the 1.74 release in
July 2019, and the talk at BOSC 2019.
There have been three releases: Biopython 1.75 (November 2019), Biopython 1.76 (December 2019), and
Biopython 1.77 (expected May/June 2020). This year saw the adoption of the black Python coding style,
our final release to support Python 2, and substantial code cleanup to focus on Python 3 only.
In 2017 we started a re-licensing plan, to transition away from our liberal but unique Biopython License
Agreement to the similar but very widely used 3-Clause BSD License. We are reviewing the code base
authorship file-by-file, to gradually dual license the entire project. All new contributions are dual licensed,
and currently over 75% of the Python and C files in the main library have been dual licensed.
Another important effort had been improving the unit test coverage, which can be viewed at CodeCov.io.
Sadly, this has stalled at about 85% (excluding online tests) for some time.
We are using GitHub-integrated continuous integration testing on Linux (using TravisCI) and Windows
(using AppVeyor), including enforcing the Python PEP8, PEP257 and black coding style guidelines. We
recommend a simple git pre-commit hook using flake8 for our contributors, which aims to reduce the
human time costs in writing compliant code.
Finally, since our last update talk in July 2019, Biopython has had 37 named contributors including
15 newcomers. This reflects our policy of trying to encourage even small contributions. Our total named
contributor count is now at 275 since the project began, over twenty year ago.

References
[1] Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T.,
Kauff, F., Wilczynski, B., de Hoon, M.J. (2009) Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. doi:10.1093/bioinformatics/btp163


Speakers
avatar for Peter Cock

Peter Cock

The James Hutton Institute
Bioinformatician at the James Hutton Institute, a member of the BOSC organizing committee, treasurer of the Open Bioinformatics Foundation, and a core developer on the Biopython project.



Monday July 20, 2020 14:20 - 14:25 EDT
BOSC

14:25 EDT

Q&A for session B4a 🍐
The presenter(s) will be available for live Q&A in this session.

Moderators
avatar for Karsten Hokamp

Karsten Hokamp

Bioinformatics Support, Trinity College Dublin

Monday July 20, 2020 14:25 - 14:30 EDT
BOSC

14:29 EDT

BOSC West Session 4b: Visualization 🍐
Accepted talks and lightning talks.

Moderators
avatar for Karsten Hokamp

Karsten Hokamp

Bioinformatics Support, Trinity College Dublin

Monday July 20, 2020 14:29 - 14:44 EDT
BOSC
  Meeting-West

14:30 EDT

Introducing App Store for IGB, a site for sharing and installing extensions for Integrated Genome Browser from BioViz.org 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Ann Loraine 1, Riddhi Patil, Sameer Shanbhag, Noor Zahara, Sneha Ramesh Watharkar, Prinav Tambvekar, Kiran Korey, Nowlan Freese

1 UNC Charlotte, Charlotte NC. Email: aloraine@uncc.edu

Project Website(s): https://bioviz.org and https://apps.bioviz.org
Source Code: https://bitbucket.org/lorainelab/appstore 
License: Common Public License Version 1.0

Integrated Genome Browser (IGB, pronounced "ig-bee") is an open source desktop genome browser used by thousands of scientists worldwide to visually analyze genomic data sets. Since 2012, IGB has supported adding new functionality to IGB as pluggable IGB "Apps," implemented as OSGi bundles that IGB's internal OSGi framework dynamically loads or unloads on demand. This works because IGB is implemented as a collection of OSGi bundles running within an OSGi framework that manages bundle lifecycle and bundle interactions with each other. In addition, the framework enables individual bundles to provide services to each other. For example, to add a menu item and its associated functionality to IGB, an IGB App need only implement a "MenuProvider" service, as defined by an interface of the same name within the IGB code base. When a user installs the App, the framework notifies IGB's internal MenuProvider consumers, which update the user interface to show the newly added menu item. Thanks to this servicefocused architecture, developers can add new functionality without needing to know deep details of the IGB code base.

However, developers and users need a way to share their work and learn about available Apps for IGB. To meet this need, we created a new "App Store" Web site where developers can upload their Apps, create and share documentation, and release new versions to users (Figure 1). To implement the App Store for IGB, we forked the open source Cytoscape App Store code and added new features and improvements. (Cytoscape 3.0 is also an OSGi-based application.) Major changes to the code base included transforming App Store into an OSGi bundle repository as per the OSGi Repository 1.0 specification and enabling App Store to use Amazon Web Services resources for storage and deployment. To enable users to install and update Apps from App Web pages within App Store, we added an App Store-specific REST endpoint to IGB itself. Javascript code running within App Store Web pages uses the endpoint to find out if IGB is running, which version (if any) of the App is installed in IGB, and whether or not the installed version has a new version available in App Store.

Although IGB App Store is a fork of Cytoscape App Store, our changes are too extensive to permit a simple merge with into the upstream code base. Nonetheless, ideas implemented in the App Store for IGB can translate to other Java programs that use OSGi to implement plugable Apps, including Cytoscape.

Speakers
avatar for Ann Loraine

Ann Loraine

Professor, UNC Charlotte
I develop Integrated Genome Browser that integrates with Galaxy using the external viewer API. I'm interested in building more connections between visualization tools like IGB and Galaxy using APIs.



Monday July 20, 2020 14:30 - 14:35 EDT
BOSC

14:35 EDT

pyGenomeTracks: Reproducible plots for multivariate genomic data sets 🍐
β†’ Abstract


The presenter will be available for live Q&A in sessions of both hemispheres and is presenting a poster in session 1 in both hemispheres.

Lucille Lopez-Delisle 1, Leily Rabbani 2, Joachim Wolff 3, Vivek Bhardwaj 2, Rolf Backofen 3,
Bjoern Gruening 3 Fidel Ramirez 2,*,Thomas Manke 2,

1 EPFL SV ISREC UPDUB, 1015 Lausanne, Switzerland: lucille.delisle@epfl.ch
2 Max Planck Institute of Immunobiology and Epigenetics, StΓΌbeweg 51, 79108 Freiburg im
Breisgau, Germany
3 Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-KΓΆhler-Allee 106, 79110 Freiburg, Germany

Project Website: https://github.com/deeptools/pyGenomeTracks
Source Code: https://github.com/deeptools/pyGenomeTracks
License: GPLv3


Main Text of Abstract

Generating publication ready plots to display multiple genomic tracks can pose a serious challenge.
Making desirable and accurate figures requires considerable effort. This is usually done by hand or
by using a vector graphic software. pyGenomeTracks (PGT) is a modular plotting tool that easily
combines multiple tracks and plots them in a highly customizable, publication ready layout. It can
generate a variety of data formats which are required by many journals. Importantly, PGT enables a
reproducible and standardized generation of images, which can be integrated in more general
workflows.

Speakers
avatar for Lucille Delisle

Lucille Delisle

Post-doc, EPFL SV ISREC UPDUB
Hi,I am a Post-doc in Denis Duboule lab working on gene regulation during development.For the scientific part, I analyzed various NGS methods including Hi-C, ATAC-seq, CUT&RUN. I recently developped a new method for single-cell RNA-seq, named baredSC.For the galaxy part, I develop... Read More →



Monday July 20, 2020 14:35 - 14:40 EDT
BOSC

14:40 EDT

MultiQC updates: Visualising results from common bioinformatics tools 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (both hemispheres).

Philip Ewels

Science for Life Laboratory (SciLifeLab), DBB, Stockholm University, Stockholm, Sweden. Email: phil.ewels@scilifelab.se

Project Website: http://multiqc.info/
Source Code: https://github.com/ewels/MultiQC
License: GNU GPLv3


Main Text of Abstract

MultiQC is a reporting tool that can parse log files from other bioinformatics software, summarising key data in graphs and tables across multiple samples and multiple tools in a single report. It’s designed to run at the end of an analysis pipeline to give an overview of the entire run.

In 2017 I presented MultiQC at BOSC – at the time it was still a relatively new tool and we did a lot of great hacking at the CodeFest afterwards. Since then the number of supported bioinformatics tools has exploded from nearly 40 to nearly 100. The MultiQC community has also grown, with over 100 different people contributing code and hundreds of pull requests being opened.

As part of this growth, the paradigm of tool support has shifted. In the early days I would write nearly all tool modules to add support. As awareness of MultiQC has grown, tool authors are increasingly opting to simply add support for MultiQC instead of writing their own custom reporting tool. The ability to add β€œcustom content” in the form of script outputs has made for a versatile tool that can cope with multiple data types and virtually any project scale.

In this talk I will describe some of the lesser-known core features that have been added to MultiQC in recent years, as well as pointing to some new ideas that are just around the corner. I’ll also cover how the MultiQC community has grown and how I manage the large number of queries and contributions as the sole maintainer.



Speakers
avatar for Phil Ewels

Phil Ewels

Bioinformatics Lead, Science for Life Laboratory
Bioinformatician doing research into next-generation sequencing applications. Lead for Bioinformatics development at the National Genomics Infrastructure in Sweden, part of SciLifeLab.Projects: MultiQC, nf-core, SRA-Explorer, QC Fail, Cluster Flow.



Monday July 20, 2020 14:40 - 14:45 EDT
BOSC

14:45 EDT

The Xena Geneset Viewer provides a visual comparison of pathways over public genomic cancer data 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Nathan Dunn 1, Brian Craft 2, Mary Goldman 2, Jing Zhu 2

1 Lawrence Berkeley National Lab, Berkeley, CA. Email: nathandunn@lbl.gov
2 University of California Santa Cruz, Santa Cruz, CA

Live Demo: https://xenageneset.berkeleybop.io/xena/
Source Code: https://github.com/ucscXena/XenaGoWidget
Project Website: https://xena.ucsc.edu/welcome-to-ucsc-xena/
License: (BSD-3-clause https://github.com/ucscXena/XenaGoWidget/blob/develop/LICENSE)

The UCSC Xena platform focuses on the visualization of cancer genomics data
[https://doi.org/10.1101/326470] by providing users with a view of large public cancer genomics
datasets such as TCGA and supports most types of functional genomics data. It uses the Xena Visual
spreadsheet to let users interact with the data by providing powerful filtering, subgrouping, and statistical
analyses. However, the spreadsheet performs analysis only on a per-gene basis, making it difficult to
visually compare sets of genes or pathways at once.
Here, we introduce the Xena GeneSet Viewer, which enables comparison of gene sets between cohorts, or
across varying sample sets from within the same cohort. The Xena GeneSet Viewer can compare activity
of gene expression, copy number variants, and somatic mutation across most of the The Cancer Genome
Atlas (TCGA) public data set.


Speakers
avatar for Nathan Dunn

Nathan Dunn

Software Developer, Lawrence Berkeley National Lab



Monday July 20, 2020 14:45 - 14:50 EDT
BOSC

14:50 EDT

New features for GRNsight: a web application for visualizing models of small- to medium-scale gene regulatory networks 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Kam D. Dahlquist 1, John David N. Dionisio 2, Mihir Samdarshi 1, Nicole A. Anguiano 2, Anindita
Varshneya 1, Eileen J. Choe 2, Yeon-Soo Shin 2, Alexia M. Filler 2, John L. Lopez 2, Edward B. Bachoura 2,
Justin Kyle T. Torres 2, Kevin B. Patterson 2, Ona O. Igbinedion 2

1 Department of Biology, Loyola Marymount University, 1 LMU Drive, Los Angeles, CA 90045 USA.
Email: kdahlquist@lmu.edu
2 Department of Computer Science, Loyola Marymount University, 1 LMU Drive, Los Angeles, CA
90045 USA. Email: dondi@lmu.edu

ProjectWebsite: https://dondi.github.io/GRNsight/
SourceCode: https://github.com/dondi/GRNsight
License: BSD License

GRNsight is an open source web application for visualizing small- to medium-scale models of gene
regulatory networks (GRNs; Dahlquist et al. 2016, https://doi.org/10.7717/peerj-cs.85). It was
originally conceived of as a companion application to GRNmap (Gene Regulatory Network Modeling
And Parameter estimation), an open source MATLAB software package that uses ordinary
differential equations to model the dynamics of medium-scale GRNs
(http://kdahlquist.github.io/GRNmap/). GRNmap uses a penalized least squares approach to
estimate mRNA production rates, expression thresholds, and regulatory weights for each
transcription factor in the network based on user-provided gene expression data and mRNA
degradation rates. GRNsight accepts GRNmap- or user-generated Excel workbooks containing an
adjacency matrix representation of the GRN, as well as SIF and GraphML files, and automatically
lays out the graph of the GRN model. It is written in JavaScript, with diagrams facilitated by D3.js.
Node.js and the Express framework handle server-side functions. GRNsight's diagrams are based on
D3.js's force graph layout algorithm, which was then extensively customized. GRNsight uses pointed
and blunt arrowheads, and colors the edges and adjusts their thicknesses based on the sign
(activation or repression) and magnitude of the GRNmap weight parameter.

Since the last presentation of GRNsight at BOSC in 2016, the code has been refactored to the model-
view-controller paradigm, a number of new features have been developed to enhance data
visualization, and the user interface has been redesigned to accommodate the new feature set. A
grid layout for nodes has been added. The viewport is now independent of the underlying bounding
box of the graph, and users can resize the viewport, move the graph within it, and zoom in and out.
The display of edge weight numerical values can be toggled on and off, and the user can manually
set the edge weight normalization factor and gray color threshold to facilitate comparison between
graphs. A color blind-friendly palette has been adopted. Nodes can now be colored, displaying a
mini "heat map" where each vertical stripe represents a timepoint in the user-inputted expression
data. In beta, a backend PostgreSQL database containing four public yeast datasets allows users to
color nodes without having to upload their own data. The database is hosted by the Amazon Web
Services Relational Database Service from which GRNsight performs dynamic queries in response
to user requests. We also created pop-up gene information webpages. When a user right-clicks on a
node, a webpage appears that is dynamically populated with data from the JASPAR, NCBI Gene,
UniProt, Ensembl, and Saccharomyces Genome Databases. Initially implemented for budding yeast,
five other species are available in beta. GRNsight is developed by undergraduates with faculty
oversight. The latter two features were prototyped as part of an open source/open science
pedagogy in a Biological Databases course and completed as senior capstone projects (e.g., see
https://xmlpipedb.cs.lmu.edu/biodb/fall2017/index.php/Main_Page).

Speakers
KD

Kam D. Dahlquist

Loyola Marymount University



Monday July 20, 2020 14:50 - 14:55 EDT
BOSC

14:50 EDT

Q & A πŸŒ€
Question and Answer session for the just finished talks.

Moderators
Monday July 20, 2020 14:50 - 15:00 EDT
Galaxy

14:55 EDT

Q&A for session B4b 🍐
The presenter(s) will be available for live Q&A in this session.

Moderators
avatar for Karsten Hokamp

Karsten Hokamp

Bioinformatics Support, Trinity College Dublin

Monday July 20, 2020 14:55 - 15:00 EDT
BOSC
 
Tuesday, July 21
 

10:00 EDT

BOSC West Session 5: Workflow management systems 🍐
Accepted talks and lightning talks.

Moderators
avatar for Peter Cock

Peter Cock

The James Hutton Institute
Bioinformatician at the James Hutton Institute, a member of the BOSC organizing committee, treasurer of the Open Bioinformatics Foundation, and a core developer on the Biopython project.

Tuesday July 21, 2020 10:00 - 11:15 EDT
BOSC
  Meeting-West

10:00 EDT

Galaxy West Session 5: Galaxy beyond genomics πŸŒ€
Accepted talks and lightning talks.

Moderators
avatar for Yvan Le Bras

Yvan Le Bras

Research engineer, French National Museum of Natural History

Tuesday July 21, 2020 10:00 - 11:15 EDT
Galaxy
  Meeting-West

10:00 EDT

BCC2020 Conference Day 3: West
Keynotes, accepted talks, posters, demos, and networking in the West.

Tuesday July 21, 2020 10:00 - 15:00 EDT
Joint

10:01 EDT

Evolution of the Nextflow workflow management system 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Evan Floden 1, Kevin Sayers 1, Paolo Di Tommaso 1,2

1 Seqera Labs, Barcelona, Spain.
2 Comparative Bioinformatics, Centre for Genomic Regulation (CRG), Barcelona, Spain
Email: paolo@seqera.io

Project Website: http://nextflow.io
Source Code: https://github.com/nextflow-io/nextflow
License: Apache 2.0


Main Text of Abstract

Nextflow is a popular open source workflow management system for the development and
deployment of data analysis pipelines. It simplifies the use of multi-scale containers and allows
seamless integration with commonly-used batch schedulers and well as built-in support for
cloud computing services such as AWS Batch and Google Cloud Life Sciences.

During the years, it has evolved to address the growing requirements from the community and
the need to manage the increasing complexity of -omics data analysis pipelines. One of the most
significant changes during the past year was the introduction of an important revision of the
Nextflow syntax termed DSL2, which represents a major shift in the capabilities of the language.
A key component of DSL2 is the ability to modularize workflows. This feature enables the reuse
of workflow components reducing development time, ensures easier testing for more robust
workflows all while continuing to use the dataflow programming paradigm that characterizes
the technology. This presentation introduces the key features of DSL2 and the advantages it
provides to workflow developers.

Another advance in the Nextflow ecosystem has been the development of Nextflow Tower by
Seqera Labs. This is an open-source web application for the management of Nextflow pipelines.
It provides an extensive overview of resource utilization and enables sharing and collaboration.
Nextflow has integrated support for Tower and provides a streamlined configuration to push
workflow data.

The Nextflow userbase has also expanded rapidly in the past year, in part due to the continued
development of high-quality, open source workflows from the https://nf-co.re community. This
valuable addition has not only accelerated development but also enriched the Nextflow
ecosystem into a community of world-class scientists collaborating on a diverse set of essential
problems.

Speakers
avatar for Evan Floden

Evan Floden

Seqera Labs



Tuesday July 21, 2020 10:01 - 10:15 EDT
BOSC

10:02 EDT

Producing biodiversity indicators from citizen science projects: update of birds and bats monitoring schemes on Galaxy-E πŸŒ€
➞ Abstract

Romain Lorrilière 1, Benjamin Yguel 2, Alan Amossé 3, Coline Royaux 4, Yves Bas 5, Yvan Le Bras 6

  1. Centre d’Ecologie et des Sciences de la Conservation, Muséum national d’Histoire naturelle - Sorbonne Université -CNRS, Paris, France.
  2. Centre d’Ecologie et des Sciences de la Conservation, Muséum national d’Histoire naturelle - Sorbonne Université -CNRS, Paris, France.
  3. PNDB research e-infrastructure, UMS PatriNat, Museum National d’Histoire Naturelle (MNHN), Concarneau, France.
  4. PNDB research e-infrastructure, UMS PatriNat, Museum National d’Histoire Naturelle (MNHN), Concarneau, France.
  5. Centre d’Ecologie et des Sciences de la Conservation, Muséum national d’Histoire naturelle - Sorbonne Université -CNRS, Paris, France.
  6. PNDB research e-infrastructure, UMS PatriNat, Museum National d’Histoire Naturelle (MNHN), Concarneau, France.

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
CR

Coline Royaux

French national museum of natural history (MNHN)



Tuesday July 21, 2020 10:02 - 10:15 EDT
Galaxy

10:15 EDT

Essential biodiversity variables on Galaxy: implementing the PAMPA application πŸŒ€
➞ Abstract

Coline Royaux 1, Dominique Pelletier 2, Jean-Baptiste Mihoub 3, Yvan Le Bras 4

  1. PNDB research e-infrastructure, UMS PatriNat, Museum National d’Histoire Naturelle (MNHN), Concarneau, France.
  2. Institut français de recherche pour l'exploitation de la mer (Ifremer), Nantes, France.
  3. Centre d’Ecologie et des Sciences de la Conservation, Muséum national d’Histoire naturelle - Sorbonne Université -CNRS, Paris, France
  4. PNDB research e-infrastructure, UMS PatriNat, Museum National d’Histoire Naturelle (MNHN), Concarneau, France.

The presenter(s) will be available for live Q&A at the end of this session (BCC West)

Speakers
CR

Coline Royaux

French national museum of natural history (MNHN)



Tuesday July 21, 2020 10:15 - 10:20 EDT
Galaxy

10:15 EDT

What’s new with nf-core: community-built bioinformatics pipelines 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (both hemispheres).

Philip Ewels 1, Alexander Peltzer 2,3, Harshil Patel 4, Gisela Gabernet 3, Maxime Ulysse Garcia 5, Olga Botvinnik 6, Sven Fillinger 4, Johannes Alneberg 7, Sven Nahnsen 2 and the nf-core & Nextflow communities.

1 Science for Life Laboratory (SciLifeLab), DBB, Stockholm University, Stockholm, Sweden. Email: phil.ewels@scilifelab.se
2 Translational Medicine and Clinical Pharmacology, Boehringer Ingelheim Pharma GmbH & CO. KG, Germany
3 Quantitative Biology Center (QBiC), University of TΓΌbingen, TΓΌbingen, Germany
4 Bioinformatics and Biostatistics, The Francis Crick Institute, London, United Kingdom
5 Department of Oncology, Karolinska Institute, Stockholm, Sweden
6 Data Sciences Platform, Chan Zuckerberg Biohub, San Francisco, California, USA
7 Science for Life Laboratory (SciLifeLab), KTH, School of Biotechnology, Division of Gene Technology, Sweden

Project Website: https://nf-co.re/
Source Code: https://github.com/nf-core/ 
License: MIT License


Main Text of Abstract

Analysis pipelines and computational workflows are increasingly becoming a core component in life-science research. Standardisation of software packaging (bioconda, docker, singularity) can now come together with workflow managers and languages (Nextflow, Snakemake, CWL, WDL) to facilitate analysis workflows that can be truly portable across virtually any computational infrastructure (server, HPC, cloud).

Over the past two years nf-core has grown to be a truly global collaborative effort to collect goldstandard workflows built using Nextflow. All pipelines adhere to strict guidelines and are built using a common template, with consistent usage patterns and best-in-class testing and support.

The year since Alex Peltzer’s presentation at BOSC 2019 has been one of rapid growth and development. In this talk I will describe some of the major developments, including:

β€’ Nearly double the community size, with over 1000 twitter followers and 500 GitHub contributors at time of writing (see https://nf-co.re/stats for details)
β€’ Growth to 43 pipelines (11 new since BOSC 2019, with 8 newly stable)
β€’ Publication of a manuscript in Nature Biotechnology describing the community
β€’ New user-friendly tools to launch pipelines, using JSON schema describing parameters

I’ll also point to future developments coming soon, such as migration to Nextflow DSL2 with modular code and improved code testing: both unit testing and full-scale data tests.

Speakers
avatar for Phil Ewels

Phil Ewels

Bioinformatics Lead, Science for Life Laboratory
Bioinformatician doing research into next-generation sequencing applications. Lead for Bioinformatics development at the National Genomics Infrastructure in Sweden, part of SciLifeLab.Projects: MultiQC, nf-core, SRA-Explorer, QC Fail, Cluster Flow.



Tuesday July 21, 2020 10:15 - 10:20 EDT
BOSC

10:20 EDT

H3AGWAS: Portable GWAS workflows for African science 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Jean-Tristan Brandenburg 1, Shaun Aron 1, Shakuntala Baichoo 2, Gerrit Botha 3,
Christopher J Fields 4, Sumir Panji 3, Nicola Mulder 3, Scott Hazelhurst 1,5

1 Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand,
Johannesburg, South Africa.
2 University of Mauritius, Mauritius.
3 Computational Biology Division, University of Cape Town, South Africa.
4 High Performance Computing in Biology, University of Illinois at UrbanaΒ­Champaign, United States.
5 School of Electrical & Information Engineering, University of the Witwatersrand, Johannesburg, South Africa

Project Website: https://h3abionet.org/
Source Code: https://github.com/h3abionet/h3agwas,
Licence: MIT Licence

Genome Wide Association Studies (GWAS) require several complex analyses of large data sets
over many iterations. Many projects in the Human Heredity and Health in Africa (H3A)
consortium are conducting GWAS. As part of its mandate to support H3A, the Pan-African
Bioinformatics Network for H3Africa (H3ABioNet), comprising 27 research groups in 17
countries, is building scientific workflows for African scientists [Baichoo et al 2018, doi:
10.1186/s12859-018-2446-1]. Many African research groups work in resource constrained
environments with over-burdened bioinformaticists and system administrators compared to
better resourced groups elsewhere. These African research groups need access to
heterogeneous computing resources and the ability to quickly deploy consistent powerful and
reproducible workflows to different environments (cloud/local/remote with or without
containers, scheduling systems) without spending hours or days setting up the environments.

In response, we built the H3AGWAS workflows to support various GWAS phases. The pipeline is
highly scalable and portable. Built using the Nextflow language and containers (Docker and
Singularity supported) it can run in different environments (single computer, SLURM, PBS,
Amazon Batch). H3AGWAS provides powerful, flexible and parameterisable workflows
supporting quality control, data format conversion, association testing using several GWAS
tools and techniques (PLINK, GEMMA, BOLTLMM), support of genotype and imputed data, post-
association analysis and annotation, meta-analysis, heritability estimation, and fine mapping.
The key workflows provide comprehensive reports to the user.

The workflow has been used by two H3A groups (AWI-Gen & Trypanogen) and other scientists
inside and outside of Africa. Although primarily designed for production work, the flexibility
and portability of the workflows have made them useful for training.

Since its original publication, significant functionality in association testing (e.g. BOLTLMM),
post-association analysis and support for new computing environments has been added.
H3AGWAS is undergoing active development. We plan to add genotype calling from image data.
H3ABioNet is also actively building other workflows to support African science for training and
production work.

Acknowledgements: H3ABioNet is funded by a NIH Common Fund Award / NHGRI Grant
Number U24HG006941. We thank Lerato Magosi, Rob Clucas, Eugene de Beste, Aboyomini
Mosaku, Don Armstrong, Ananyo Choudhury, Ayton Meintjes, Michele Ramsay and
Dhriti Sengupta for their support.

Speakers
BJ

Brandenburg Jean-Tristan

Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand.



Tuesday July 21, 2020 10:20 - 10:25 EDT
BOSC

10:20 EDT

VINYL: Variant prIoritizatioN bY survivaL analysis πŸŒ€
➞  Abstract

Pietro Mandreoli 1,2, Marco Antonio Tangaro 2, David S. Horner 1,2, Federico Zambelli 1,2, Graziano Pesole 2,3, Matteo Chiara 1,2

  1. Department of Biosciences, University of Milan, via Celoria 26, 20133 Milano, Italy
  2. Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126 Bari, Italy
  3. Department of Biosciences, Biotechnologies and Biopharmaceutics, University of Bari, Via Orabona 4, 70126 Bari, Italy

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
MC

Matteo Chiara

Department of Biosciences, University of Milan



Tuesday July 21, 2020 10:20 - 10:25 EDT
Galaxy

10:25 EDT

Evaluating customized database generation methods for metaproteomics analysis within the Galaxy platform. πŸŒ€
➞ Abstract

Subina Mehta 1, Thomas McGowan 1, Francesco Delogu 2, James E Johnson 1, Praveen Kumar 1, Magnus Arntzen 2, Marie Crane 1, Peter S.Thuy-Boun 3, Dennis W Wolan 3, Timothy J Griffin 1, Pratik D Jagtap 1

  1. University of Minnesota, Minneapolis, MN, USA.
  2. Norwegian University of Life Sciences, Γ…s, Norway.
  3. Scripps Research, La Jolla, CA, USA

The presenter(s) will be available for live Q&A in both BCC West and BCC East.

Speakers
avatar for Subina Mehta

Subina Mehta

Researcher, University of Minnesota



Tuesday July 21, 2020 10:25 - 10:30 EDT
Galaxy

10:25 EDT

Q&A 🍐
The presenter(s) will be available for live Q&A in this session.

Moderators
avatar for Peter Cock

Peter Cock

The James Hutton Institute
Bioinformatician at the James Hutton Institute, a member of the BOSC organizing committee, treasurer of the Open Bioinformatics Foundation, and a core developer on the Biopython project.

Tuesday July 21, 2020 10:25 - 10:30 EDT
BOSC

10:30 EDT

An automated, accessible proteogenomic pipeline for high confidence detection and rigorous validation of novel peptide sequence variants in Galaxy-P πŸŒ€
➞ Abstract

Andrew Rajczewski 1, Bo Wen 2, James Johnson 1, Subina Mehta 1, Praveen Kumar 1, Ray Sajulga 1, Qiyuan Han 1, Pratik Jagtap 1, Bing Zhang 2, Natalia Tretyakova 1, Timothy Griffin 1

  1. University of Minnesota, Minneapolis MN.
  2. Baylor College of Medicine, Houston TX.

The presenter(s) will be available for live Q&A in both BCC West and BCC East.

Speakers
AT

Andrew T. Rajczewski

University of Minnesota



Tuesday July 21, 2020 10:30 - 10:35 EDT
Galaxy

10:30 EDT

CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Michael Kotliar 1*, Andrey V. Kartashov 1, Artem Barski 1,2

1 Division of Allergy and Immunology, Cincinnati Children’s Hospital Medical Center and Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, OH, USA and
2 Division of Human Genetics, Cincinnati Children’s Hospital Medical Center and Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, OH, USA
*michael.kotliar@cchmc.org

Project Website: https://barski-lab.github.io/cwl-airflow/ 
Source Code: https://github.com/Barski-lab/cwl-airflow
License: Apache License 2.0

Modern biomedical research has seen a remarkable increase in the production and computational analysis of large datasets, leading to an urgent need to share standardized analytical techniques. However, of the >100 computational workflow systems used in research, most define their own specifications for computational pipelines. Common Workflow Language (CWL) working group was formed to create a language for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments. Herein, we present CWL-Airflow, a package that adds support for CWL to the Apache Airflow pipeline manager. Addition of the CWL capability to Airflow has made it more convenient for scientific computing, in which the users are more interested in the flow of data than the tasks being executed. While Airflow defines workflows only as sequences of steps to be executed (i.e., DAGs), the CWL description of inputs and outputs leads to better representation of data flow. This allows for a better understanding of data dependencies and produces more readable workflows.

After CWL-Airflow was published in 2019, we introduced major changes in the architecture of the program making it more suitable for large scale data processing. Original approach of creating a separate CWLDAG-class instance on each new run was replaced by more efficient one – triggering the same workflow with updated input parameters through API server. Additionally, we added Workflow Execution Service (WES) API as a standardized way to programmatically manage workflow execution process. In order to run a CWL pipeline in Airflow, our package loads the CWL workflow descriptor file and creates a CWLDAG-class instance that reflects the CWL workflow structure. Workflow step execution order is based on step inputs and outputs therefore implementing dataflow principles and architecture that are missing in Airflow. For computationally intensive pipelines Airflow can use the Celery task queue to distribute processing over multiple nodes. The Celery system helps not only to balance the load over the different machines but also to define task priorities by assigning them to the separate queues.

Since the key promise of CWL specification is the portability of analyses and their reproducibility, CWL-Airflow team took part in Global Alliance for Genomics and Health (GA4GH) Workflow Execution Challenge both as workflow author and as a participant. The results showed that CWLAirflow complies with the CWL specification, supports portability, and performs analysis in a reproducible manner. CWL-Airflow leverages all the benefits provided by Airflow such as scaling and multiple platforms support, web-based GUI, workflow execution pools and queues, simple installation and configuration. In summary, CWL-Airflow complies with CWL v. 1.1 specification and will provide users with the ability to execute CWL workflows anywhere Airflow can run – from a laptop to a cluster or cloud environment.

Speakers
avatar for Michael Kotliar

Michael Kotliar

Cincinnati Children's Hospital Medical Center



Tuesday July 21, 2020 10:30 - 10:45 EDT
BOSC

10:35 EDT

Q & A πŸŒ€
Question and Answer session for the just finished talks.

Moderators
avatar for Yvan Le Bras

Yvan Le Bras

Research engineer, French National Museum of Natural History

Tuesday July 21, 2020 10:35 - 10:40 EDT
Galaxy

10:40 EDT

Galaxy enables FAIR mass spectrometry imaging data analysis of urothelial carcinoma tissues πŸŒ€
➞ Abstract

Melanie Christine FΓΆll 1, Veronika Volkmann 1, Kathrin-Enderle Ammour 1, Peter Bronsert 1, Oliver Schilling 1

  1. Institute for Surgical Pathology, Faculty of Medicine, University of Freiburg, Freiburg, Germany

The presenter(s) will be available for live Q&A at the end of this session (BCC West)

Speakers
avatar for Melanie FΓΆll

Melanie FΓΆll

PostDoc, Northeastern University Boston



Tuesday July 21, 2020 10:40 - 10:45 EDT
Galaxy

10:45 EDT

Challenges in implementing Janis: A generator for CWL and WDL pipelines 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC East).

Michael Franklin 1,2, Richard Lupat 2, Daniel Park 1, Bernard Pope 1, Evan Thomas 3, Jiaan Yu 2, Mohammad Bhuyan 3, Tony Papenfuss 3, Jason Li 2

1 University of Melbourne, Melbourne. Email: michael.franklin@unimelb.edu.au
2 Peter MacCallum Cancer Centre, Melbourne.
3 Walter and Eliza Hall Institute of Medical Research.

Project Website: https://janis.readthedocs.io 
Source Code: https://github.com/PMCC-BioinformaticsCore/janis 
License: GPL-3.0


Main Text of Abstract

There are many frameworks for building bioinformatics pipelines, including the Common Workflow Language (CWL), Workflow Description Language (WDL), Nextflow and more. Each framework brings a community and a set of resources, including engines and other tools. The incompatibility of these frameworks poses considerable challenges for portability, where changing between systems requires substantial re-engineering efforts and is an inhibitor to sharing workflows. There are many external differences in these frameworks, such as implementation language, however they have a considerable overlap in their underlying features.

Janis is an open-source Python framework that addresses this interoperability problem by abstracting the workflow model in order to generate CWL and WDL pipelines. The Janis API simplifies many aspects of building workflows and can mask idiosyncrasies of the target specifications while still allowing for rich workflow logic to be represented. The ability to target multiple workflow specifications unlocks tools from their respective communities and mitigates the risks and effects of pipeline frameworks becoming unsupported.

In this talk we will discuss the challenges that we faced when abstracting the workflow model for CWL and WDL. In particular, standardizing a library of functions, conditional steps and porting functionality such as secondary files into WDL. We’ll demonstrate how an abstraction can provide useful benefits to workflow authors and the community.

Since our previous presentation of Janis at BOSC in 2019, the API has been structured to be simpler to use and the feature set has been expanded to represent more complex logic to be represented. These changes allowed us to increase the number of researchers using Janis for their analyses. We consider that the abstraction that Janis provides is a powerful way to build workflows, allowing users to take advantage of the substantial collection of community tools across a wide variety of workflow technologies.

Speakers
avatar for Michael Franklin

Michael Franklin

Research Software Engineer, University of Melbourne
I'm a research software engineer at the University of Melbourne / Peter MacCallum Cancer Centre who's interested in all things pipelines! I develop Janis, a workflow assistant that generates CWL and WDL. I'm also interested in general programming, specifically web (React) and database-y... Read More →



Tuesday July 21, 2020 10:45 - 10:50 EDT
BOSC
  Meeting-West

10:45 EDT

Democratizing DIA analysis on public cloud infrastructures via Galaxy πŸŒ€
➞ Abstract

Matthias Fahrner 1, Melanie Christine FΓΆll 1, BjΓΆrn Andreas Grüning 2, Matthias Bernt 3, Hannes RΓΆst 4, Oliver Schilling 1

  1. Institute for Surgical Pathology, Faculty of Medicine, University of Freiburg, Freiburg, Germany * E-mail: matthias.fahrner@uniklinik-freiburg.de
  2. Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
  3. Helmholtz Centre for Environmental Research -- UFZ, Young Investigators Group Bioinformatics and Transcriptomics, Leipzig, Germany
  4. Department of Molecular Genetics, University of Toronto, Toronto, Canada

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Matthias Fahrner

Matthias Fahrner

PhD student, Institute for Surgical Pathology, Medical Center – University of Freiburg



Tuesday July 21, 2020 10:45 - 10:50 EDT
Galaxy

10:50 EDT

EOSC-Nordic Climate Science Workbench roadmap πŸŒ€
➞ Abstract

Anne Claire Fouilloux 1, Adil Hasan 2, Ari Lukkarinen 3, Hamish Struthers 4

  1. Department of Geosciences, University of Oslo, Norway. Email: annefou@geo.uio.no
  2. UNINETT Sigma2, Norway.
  3. CSC, Finland.
  4. SNIC, Sweden.

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Anne Fouilloux

Anne Fouilloux

Research Software Engineer, University of Oslo
I am working on Galaxy Climate (development of tools, integration of climate data, training material).



Tuesday July 21, 2020 10:50 - 10:55 EDT
Galaxy

10:50 EDT

Rapidly creating portable pipelines with aCLImatise 🍐
β†’ Abstract, Video


The presenter(s) will be available for live Q&A in this session (BCC East).

Michael Milton 1,2, Natalie Thorne 1, 2, 3, 4, Bioinformatics Working Group 1

1 Melbourne Genomics Health Alliance,
2 Walter and Eliza Hall Institute,
3 Murdoch Children's Research Institute,
4 The University of Melbourne

Project Website: https://github.com/aclimatise
Source Code: https://github.com/aclimatise
License: GPLv3



Speakers
MM

Michael Milton

Melbourne Genomics



Tuesday July 21, 2020 10:50 - 10:55 EDT
BOSC
  Meeting-West

10:55 EDT

Functionally Assembled Terrestrial Ecosystem Simulator (FATES) with Community Land Model in Galaxy πŸŒ€
➞ Abstract

Anne Fouilloux 1, Hui Tang 2,3, Eva Lieungh 3, Sonya R. Geange 4, Peter Horvath 3, Anders Bryn 3

  1. Department of Geosciences, University of Oslo, Norway.
  2. Department of Geosciences, University of Oslo, Norway.
  3. Natural History Museum, University of Oslo, Norway.
  4. Department of Biological Sciences, University of Bergen, Norway.

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
avatar for Anne Fouilloux

Anne Fouilloux

Research Software Engineer, University of Oslo
I am working on Galaxy Climate (development of tools, integration of climate data, training material).
HT

Hui Tang

University of Oslo, Department of Geosciences
SG

Sonya Geange

Department of Biological Sciences, University of Bergen, Norway



Tuesday July 21, 2020 10:55 - 11:10 EDT
Galaxy

10:55 EDT

The all-new Genomics Virtual Lab 🍐
→ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).


Since 2012, the Genomics Virtual Lab (GVL) has provided a platform for deploying a production-grade Galaxy in the cloud. More than 25,000 GVL instances have been launched since then, many for training events that require dynamically scalable infrastructure. During this period, it became evident that managed Galaxy services, such as the usegalaxy.* federation, are prefered by users for simplicity of access. However, the overhead of maintaining many such instances is taxing on the system administrators and repetitive across the Galaxy community. Managed instances are also more challenging to customize to users’ requirements and require ongoing, active maintenance. In response, over the past two years, we have developed an all-new version of the GVL that is intended to be used as a platform for deploying Galaxy. With the GVL v5, every installation of Galaxy is a production installation with complete functionality out of the box. The installation includes horizontally scalable installation of Galaxy, built-in monitoring capabilities with Grafana, centralized authentication with Keycloak, web-based terminal access, and Jupyter. The GVL also comes with CloudMan as a graphical manager for Galaxy, and additionally cloud infrastructure in the case of cloud deployments. The platform has been tested on a number of Galaxy Training Network tutorials and is ready for use. GVL 5 is based on software containers and container orchestration tools such as Docker, Kubernetes, and Helm. This makes it portable across systems while promoting replicability and uniformity. Thus far, the GVL has been made available for automated deployment on 4 clouds. Additionally, it is possible to deploy the GVL on local resources through a Helm chart. In this talk, we will present the motivation, currently available features, and near-future plans for the GVL platform, such as multiple projects and shared storage.

Speakers
avatar for The Other Enis Afgan

The Other Enis Afgan

Research scientist, Johns Hopkins University



Tuesday July 21, 2020 10:55 - 11:10 EDT
BOSC

11:10 EDT

Q & A πŸŒ€
Question and Answer session for the just finished talks.

Moderators
avatar for Yvan Le Bras

Yvan Le Bras

Research engineer, French National Museum of Natural History

Tuesday July 21, 2020 11:10 - 11:15 EDT
Galaxy

11:10 EDT

Q&A 🍐
The presenter(s) will be available for live Q&A in this session.

Moderators
avatar for Peter Cock

Peter Cock

The James Hutton Institute
Bioinformatician at the James Hutton Institute, a member of the BOSC organizing committee, treasurer of the Open Bioinformatics Foundation, and a core developer on the Biopython project.

Tuesday July 21, 2020 11:10 - 11:15 EDT
BOSC

11:30 EDT

GigaScience Sponsor Table
GigaScience is an online open access, open data, open peer-review journal published by Oxford University Press and BGI. The journal offers β€˜big data’ research from the life and biomedical sciences, and on top of 'Omics research includes the growing range of work that uses difficult-to-access large-scale data, such as imaging, neuroscience, ecology, systems biology, and other new types of shareable data. GigaScience is unique in the publishing industry as it publishes all research objects (data, software tools, source code, workflows, containers and other elements related to the work underpinning the findings in the article). Promoting Open Science, all published software needs to be under an OSI-license, all supporting data must be available and open, and all peer review is carried out transparently. Presenting workflows via our GigaGalaxy.net server, novel work presented at the meeting utilising Galaxy is eligible to a 15% APC if it is submitted to our Galaxy series.

Please stop by and learn more about GigaScience. We are located on the first floor of the Poster / Demo building,

Speakers
avatar for Ken Cho

Ken Cho

Systems Programmer Analyst, GigaScience
avatar for Scott Edmunds

Scott Edmunds

Editor in Chief, GigaScience Press/BGI Hong Kong
Scott Edmunds is the Editor in Chief of GigaScience Press. With over 15 years experience in Open Access and Open Data publishing he is co-founder of CivicSight (formerly Open Data Hong Kong) and CitizenScience.Asia, and is on the Board of Directors of the Dryad Digital Repository... Read More →
avatar for Laurie Goodman

Laurie Goodman

Publishing Director, GigaScience Press
Laurie Goodman, PhD, is the Publishing Director for GigaScience Press, which publishes the international, open-science journals GigaScience and GigaByte. Both journals have won awards for Innovation in publishing. Dr. Goodman received her BS and MS from Stanford University in 1986... Read More →



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P1-02: : A data model approach to data coordination in the Human Tumor Atlas Network project 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers

Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

11:30 EDT

P2-03: : COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Andrea Guarracino

Andrea Guarracino

PhD student, Centre for Molecular Bioinformatics, Department of Biology, University Of Rome Tor Vergata, Rome, Italy.


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P2-06: : CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Michael Kotliar

Michael Kotliar

Cincinnati Children's Hospital Medical Center


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P2-09: : Democratizing DIA analysis on public cloud infrastructures via Galaxy πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Matthias Fahrner

Matthias Fahrner

PhD student, Institute for Surgical Pathology, Medical Center – University of Freiburg



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P2-11: : Development of TranSMART and Applications in Biomedical Research 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
PR

Peter Rice

Oryza Bioinformatics Ltd.


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P2-12: : Dnpatterntools suite for nucleosome positioning sequence patterns πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Erinija Pranckeviciene

Erinija Pranckeviciene

Assoc. Prof., Vilnius University



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P3-05: : EOSC-Nordic Climate Science Workbench roadmap πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Anne Fouilloux

Anne Fouilloux

Research Software Engineer, University of Oslo
I am working on Galaxy Climate (development of tools, integration of climate data, training material).



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P3-07: : Evaluating customized database generation methods for metaproteomics analysis within the Galaxy platform πŸŒ€
➞ Abstract, Poster

This poster will be presented live at BCC West.

Speakers
avatar for Subina Mehta

Subina Mehta

Researcher, University of Minnesota



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P3-08: : Evolution of the Nextflow workflow management system 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Evan Floden

Evan Floden

Seqera Labs


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P4-02: : Galaxy enables FAIR mass spectrometry imaging data analysis of urothelial carcinoma tissues πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Melanie FΓΆll

Melanie FΓΆll

PostDoc, Northeastern University Boston



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P4-08: : H3AGWAS: Portable GWAS workflows for African science 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
BJ

Brandenburg Jean-Tristan

Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand.



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P4-12: : In silico characterization of FK506-binding proteins in wheat πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P5-02: : Introducing App Store for IGB, a site for sharing and installing extensions for Integrated Genome Browser from BioViz.org 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Ann Loraine

Ann Loraine

Professor, UNC Charlotte
I develop Integrated Genome Browser that integrates with Galaxy using the external viewer API. I'm interested in building more connections between visualization tools like IGB and Galaxy using APIs.


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P5-05: : Nebulizer: a command-line utility for remote Galaxy admin tasks πŸŒ€
➞ Abstract

Nebulizer is a Python utility which provides a high-level interactive command line interface to remotely administer Galaxy servers

Nebulizer enables admin operations to be performed efficiently via the command line, as an alternative to using a Galaxy instance's web interface. It was developed to help "part-time" admins perform day-to-day administrative tasks (specifically management of users, tools and data libraries) across multiple Galaxy instances.

The poster gives a brief overview of Nebulizer's main functionality; additionally there is a pre-recorded demo (https://youtu.be/_eXKTSSYBgY) which covers initial installation and set-up, and shows how it can be used to manage users, create and populate data libraries, and to install, update and remove tools.

The poster will presented live at BCC West on Tuesday July 21, 2020 16:30 - 17:15 (Europe/London timezone), and will include a live demo of the utility lasting approximately 30 minutes.

Speakers
avatar for Peter Briggs

Peter Briggs

Computer Officer, Bioinformatics Core Facility, University of Manchester
Software developer supporting a team of informaticians within the Bioinformatics Core Facility (BCF) in the Faculty of Biology Medicine & Health (FBMH) at the University of Manchester. Responsibilities include supporting a number of local Galaxy instances both internally (for local... Read More →



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P5-06: : New features for GRNsight: a web application for visualizing models of small- to medium-scale gene regulatory networks 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
KD

Kam D. Dahlquist

Loyola Marymount University


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P5-11: : Producing biodiversity indicators from citizen science projects: update of birds and bats monitoring schemes on Galaxy-E πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers
CR

Coline Royaux

French national museum of natural history (MNHN)



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P6-05: : SIMD Everywhere: portable implementations of SIMD intrinsics 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers

Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P6-07: : Survey of metaproteomics software tools for functional microbiome analysis πŸŒ€
➞ Abstract

This poster will be presented live at BCC West.

Speakers


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P6-09: : Testing and investigating deep learning models for promoter recognition 🍐
This poster will be presented live at BCC West: https://deskle.com/dfGMxtY

A video is also provided: https://www.youtube.com/watch?v=_9r8D394T40

➞ Abstract

 Understanding DNA sequences has been an ongoing endeavour within bioinformatics research. Recognizing the functionality of DNA sequences is a non-trivial and complex task that can bring insights into understanding DNA. This project explored deep learning models for recognizing gene regulating regions of DNA - more specifically, promoters. Our project delves into implementing current models from the literature to replicate their results and explore how the models might be recognizing promoters. Literature in this field typically include web applications for the community to use, where one can submit limited data to obtain the model’s result. This has become the standard in the field, making it rare for the authors to provide the source code of their work. This can create unnecessary obstacles for new research into the field. 

 Previous work has also focused on limited curated datasets to both train and evaluate their models using cross-validation, obtaining high-performing results across a variety of metrics. We implemented various models from the literature and compared them against each other, using their datasets interchangeably throughout the comparison tests. These comparisons highlight shortcomings within the training and testing datasets for these models, prompting us to create a robust promoter recognition testing dataset and develop a testing methodology that creates a wide variety of testing datasets for promoter recognition. 

 It is then possible to test and analyse the models from the literature with the newly created datasets, which provides a standard benchmark that mimics a realistic scenario. To avoid replicability and model comparability issues in the future, we open-source our findings and testing methodology. New deep learning (DL) models can be implemented as Pytorch modules, while other machine learning (ML) models can be implemented using sklearn. Both can be trained and tested using sklearn’s procedures, where DL models can make use of skorch as a wrapper around PyTorch with an added sklearn interface. Training and testing scripts for DL models have been added as examples and can be expanded by the open source community. While we focus on DL models in this project, our training and testing scripts are also applicable to other ML models. 



Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P6-11: : The Xena Geneset Viewer provides a visual comparison of pathways over public genomic cancer data 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Nathan Dunn

Nathan Dunn

Software Developer, Lawrence Berkeley National Lab


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P7-02: : ViralMSA: massively scalable reference-guided multiple sequence alignment of viral genomes 🍐
➞ Abstract

This poster will be presented live at BCC West.

Speakers
avatar for Niema Moshiri

Niema Moshiri

Assistant Teaching Professor, University of California, San Diego


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

P7-04: : What’s new with nf-core: community-built bioinformatics pipelines 🍐
➞ Abstract

This poster will be presented live at BCC East and BCC West.

Speakers
avatar for Phil Ewels

Phil Ewels

Bioinformatics Lead, Science for Life Laboratory
Bioinformatician doing research into next-generation sequencing applications. Lead for Bioinformatics development at the National Genomics Infrastructure in Sweden, part of SciLifeLab.Projects: MultiQC, nf-core, SRA-Explorer, QC Fail, Cluster Flow.


Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

11:30 EDT

Poster / Demo West Session 3
The second poster and demo session of BCC2020.

Access the Poster / Demo hall through the "Go to Posters" button at the top left in the main BCC2020 Remo conference space.

Tuesday July 21, 2020 11:30 - 12:15 EDT
Joint

12:30 EDT

Joint West Session 6: COVID-19
Accepted talks and lightning talks. A joint session with the BOSC and Galaxy communiuties.

Moderators
Tuesday July 21, 2020 12:30 - 13:45 EDT
Joint

12:31 EDT

Serratus: Ultra-deep search for novel coronaviruses 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Rayan Chikhi 1, Kyl Wellman, Steven J. Hallam 2, Anton Korobeynikov 3, Dan Lohr, Robert
C. Edgar, Artem Babaian, Dmitry Meloshko 3, Tomer Altman, Ryan J. McLaughlin 2, Jeff
Taylor, Victor Lin, and Gherman Novakovsky 2.

# Author list randomized
1 Institut Pasteur & CNRS
2 University of British Columbia
3 Center for Algorithmic Biotechnologies, Saint Petersburg State University.

Contact: artem@rRNA.ca
Project Website: http://serratus.io
Source Code: https://github.com/ababaian/serratus
License: GPLv3


Abstract

Despite intense efforts to sequence and analyze SARS-CoV-2 isolates, understanding
of the virus's provenance is limited by incomplete genomic characterization of the
Coronaviridae (CoV) family.

Serratus is an open science project for the discovery of new virus sequences by
aligning all RNA-seq, meta-genomic, meta-transcriptomic and environmental NGS data
in the NCBI Short Read Archive (SRA).

Here we report a preliminary survey of 1.14 million sequence libraries (26.78
petabases) where we have uncovered several previously unreported CoV species, and
identified thousands of CoV+ libraries.

To perform this ultra-high throughput CoV search, we leveraged AWS cloud HPC with
a 22,500 vCPU cluster. Using a hyper-parallelized architecture we could bypass
conventional networking and disk IO bottlenecks to achieve a processing rate in
excess of 500,000 sequencing libraries per day, at a cost of ~$0.01 per library.

We are building a 100% open data-set of all viral sequences in the SRA to accelerate
the translation of these data. All notebooks, source-code, raw and processed
sequence data generated in Serratus is freely available within 24h of discovery.

Expanding the known repertoire of coronaviruses will not only help determine
the origins of this pandemic, but it can help prevent another one.

Speakers
avatar for Artem Babaian

Artem Babaian

University of British Columbia



Tuesday July 21, 2020 12:31 - 12:45 EDT
Joint

12:45 EDT

MALVIRUS: viral variant calling made easy 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

S. Ciccolella*, L. Denti*, P. Bonizzoni, G. Della Vedova, Y. Pirola**, M. Previtali**

Dept. of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy.

E-mail: yuri.pirola@unimib.it
*Joint First Authors.
**Joint Last Authors.

Project Website: https://algolab.github.io/MALVIRUS/
Source Code: https://github.com/algolab/malvirus
License: GNU General Public License version 3

The SARS-CoV-2 pandemic has put the global health care services to the test and many researchers are racing to face its swift and rapid spread. The availability of efficient approaches to analyze variations from the growing amount of sequencing data daily produced is of the utmost importance.

We introduce MALVIRUS, an easy-to-install and easy-to-use web application that assists users in computing a SNP catalog extracted from the sequences of a viral population and in efficiently calling variants of the catalog that are in a read sample. MALVIRUS implements a pipeline divided in two modules, based on four state-of-the-art open source tools. The first module uses MAFFT and snp-sites to compute the SNP catalog from the input set of sequences whereas the second module uses KMC3 and MALVA to call the genotypes from the input read sample. MALVIRUS is designed to work with viral populations and viral high-coverage read samples. Tests on Illumina and Nanopore samples sequenced from SARS-CoV-2 strains prove the efficiency and the effectiveness of MALVIRUS in genotyping viral strains with respect to the SNP catalog extracted from GISAID data.

MALVIRUS is released under the GPL3 license and is available as a self-hosted web application. It is distributed as a Docker image and it uses open source platforms as backbone, such as Snakemake for pipeline executions and Bioconda for package management. These technologies will enable us to scale MALVIRUS to public clouds or computing infrastructures and offer it as a public service. The web interface is composed by a Flask backend and a React JS frontend. The entire application can be easily installed using "docker run -p 56733:80 -v mvjobs:/jobs algolab/malvirus". Then, once the Docker container is running, MALVIRUS is easily accessible through your preferred web browser at http://localhost:56733/.

The computed genotypes can be viewed from the web interface or can be downloaded as VCF or ODS files for further processing. An extensive documentation for the entire process from the installation to the examination of the results is available at https://algolab.github.io/MALVIRUS/ together with a detailed tutorial.

Speakers
avatar for Yuri Pirola

Yuri Pirola

DISCo, Univ. degli Studi di Milano-Bicocca
Interested in designing algorithms and developing useful tools.



Tuesday July 21, 2020 12:45 - 12:50 EDT
Joint

12:50 EDT

Development of TranSMART and Applications in Biomedical Research 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Peter Rice 1, Kendra Elliston 2, Keith Elliston 2, Rudy Potenzone 3

1 Oryza Bioinformatics Ltd, Royston, UK. Email: ricepeterm@yahoo.co.uk
2 Axiomedix Inc., Bedford, MA, USA.
3 I2b2 tranSMART Foundation, Wakefield MA, USA.

Project Website: https://wiki.transmartfoundation.org/
Source Code: https://github.com/transmart-foundation/transmart/
License: GPLv3 Language: Grails, R, SQL, Groovy, Kettle


TranSMART is a suite of data exploration, visualization, and ETL tools, which were developed by
pharma for translational research studies based on the i2b2 project www.i2b2.org. Clinical and
omics data from clinical trials can be analyzed through a web client or exported for local analysis.
Over 200 curated studies from GEO, TCGA and other sources are available for public download

When an open source version was made available in 2012 several major projects (especially
eTRIKS https://www.etriks.org/ and TraIT https://trait.health-ri.nl/ in Europe) added support for
PostgreSQL in addition to the original Oracle database, increased the available datatypes, and added
many new plugin features. The latest version 19.0 includes a full code review, upgrades the grails
and Postgres versions and other internal libraries, accelerates data loading, and supports the
standard i2b2 data model.

The i2b2 code from which tranSMART arose is now also fully open source. We have new funding to
support development in both tranSMART (clinical and `omics data) and i2b2 (clinical data for all
patients) to share common database platform.

TranSMART continues to be developed and maintained by a community effort, coordinated by the
i2b2 tranSMART Foundation and is supported by Axiomedix Inc with a publicly available
knowledgebase built in Zendesk https://transmart.support.axiomedix.com/.

We have several current projects applying tranSMART in biomedical research:

In a project with Dell Technologies funding we are curating 80+ Coronavirus studies (SAR-
CoV-2, SARS-CoV and MERS-CoV) from the Gene Expression Omnibus and other sources for
a public tranSMART 19 server to be available from June 2020. The datasets will also be
available for download into any tranSMART instance. We expect to expand the analysis
repertoire of tranSMART to meet the urgent needs of CoVid19 research.

We have a public TB Data Commons site https://transmart-ospf.axiomedix.net/transmart/
with the Open Source Pharma Foundation http://www.ospfound.org/ loaded with data for
Tuberculosis and emerging tropical viruses plus Metformin studies relevant to TB research.

Speakers
PR

Peter Rice

Oryza Bioinformatics Ltd.



Tuesday July 21, 2020 12:50 - 12:55 EDT
Joint

12:55 EDT

ViPRA-Haplo: de novo reconstruction of viral populations using paired end sequencing data 🍐
Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).
You are welcome to Poster P7-01 for more details.

Weiling Li 1, Raunaq Malhotra 2, Steven Wu 3, Manjari Jha 4, Allen Rodrigo 5, Mary Poss 6, and Raj Acharya 7

1 Indiana University, Bloomington, IN. Email: wli6@iu.edu
2 GNS Healthcare, Cambridge, MA. Email: rmalhotra@gnshealthcare.com
3 BioConsortia, Davis, CA. Email: stevenhwu@gmail.com
4 Microsoft, Redmond, WA. Email: manjari.mu@gmail.com
5 The University of Auckland, Auckland, New Zealand. Email: a.rodrigo@auckland.ac.nz
6 The Pennsylvania State University, University Park, PA. Email: maryposs@gmail.com
7 Indiana University, Bloomington, IN. Email: racharya@iu.edu

Project Website: https://github.com/raunaq-m/MLEHaplo
Source Code: https://github.com/raunaq-m/MLEHaplo
License: (BSD2-ClauseLicense)


Main Text of Abstract

Viruses replicating within a host exist as a collection of closely related genetic variants known as viral haplotypes. The diversity in a viral population, or quasispecies, is due to mutations (insertions, deletions or substitutions) or recombination events that occur during virus replication. These haplotypes differ in relative frequencies and together play an important role in the fitness and evolution of the viral population. This variation in viral sequences poses a challenge to vaccine design and drug development. We present ViPRA-Haplo, a de novo assembly algorithm for reconstructing viral haplotypes in a virus population from paired-end next generation sequencing (NGS) data. The proposed Viral Path Reconstruction Algorithm (ViPRA) generates a subset of paths from a De Bruijn graph of reads using the pairing information of reads. These paths represent contigs of the virus. The paths generated by ViPRA are an over-estimation of the possible contigs. We then propose two methods to obtain an optimal set of contigs representing the viral haplotypes. The first method uses VSEARCH to cluster the paths reconstructed by ViPRA. The centroid in each cluster represents a contig. Second, we proposed a method MLEHaplo that generates a maximum likelihood estimate of the viral populations using the ViPRA paths. We evaluate and compare ViPRA-Haplo on a simulated data set, on a real HIV MiSeq data set (SRR961514) with sequencing errors, and on an emerging SARS-CoV-2 real data set (SRR10903401). In the simulated data, ViPRA-Haplo reconstructs full length viral haplotypes having a 99.7% sequence identity to the true viral haplotypes at 250x sequencing coverage. In the real NGS data, error correction software Karect is used to improve de novo assembly. The real HIV data set contains 714,994 pairs (2x250 bp) of reads that cover the five strains to 20,000x. Our method can reconstruct contigs that cover over 90% of each strain of the reference genomes, which is higher than the benchmark tool PEHaplo. In the SARS-CoV-2 data, after filtering for SARS-CoV-2 contigs using the metagenomic classifier Centrifuge, the contigs reconstructed by our method cover over 99% of the reference genome. The comparisons on both simulated and real data show that ViPRA-Haplo outperforms the existing tools by a higher coverage in reference genome(s), and in retaining the variation in viral sequence present naturally in the viral population.

Speakers
WL

Weiling Li

postdoc, Indiana University - Bloomington



Tuesday July 21, 2020 12:55 - 13:00 EDT
Joint

13:00 EDT

Q&A
The presenter(s) will be available for live Q&A in this session.

Moderators
Tuesday July 21, 2020 13:00 - 13:05 EDT
Joint

13:05 EDT

COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Andrea Guarracino 5, Peter Amstutz 2, Thomas Liener 3, Michael Crusoe 4, Adam Novak 6, Erik Garrison 6, Tazro Ohta 7, Bonface Munyoki 1, Danielle Welter 8, Sarah Zaranek 2, Alexander (Sasha) Wait Zaranek 2, Pjotr Prins 1

1 Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science
Center, Memphis, TN, USA.
2 Curii Corporation, Boston, MA, USA.
3 independent.
4 Department of Computer Science, Faculty of Sciences, Vrije Universiteit Amsterdam, The
Netherlands.
5 Centre for Molecular Bioinformatics, Department of Biology, University Of Rome Tor Vergata,
Rome, Italy.
6 UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
7 Database Center for Life Sciences, Tokyo, Japan.
8 Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Luxembourg.

Project Website: http://covid19.genenetwork.org/
Source Code: https://github.com/arvados/bh20-seq-resource
License: Apache 2.0

As part of the COVID-19 Virtual Biohackathon 2020 we formed a working group to create a
COVID-19 Public Sequence Resource (COVID-19 PubSeq) for SARS-CoV-2 virus sequences. Our goal
was to create a repository that had a low barrier to entry for uploading and analyzing sequence
data. We followed FAIR data practices: data are published with public domain (CC0) or creative
commons 4.0 (CC-BY-4.0) license, structured metadata is validated against standard ontologies,
and, most importantly, reproducible workflows are executed after the upload in order to provide
up-to-date results rapidly and in standardized data formats.

Existing data repositories for viral data include GISAID, EBI ENA and NCBI. These repositories allow
for free sharing data, but do not enforce strict quality control on submitted data or metadata, and
do not add value in terms of running additional analysis. In addition, some databases have a
restricted license which prevents data from being used in online web services and on-the-fly
computation, hindering research.

We created a prototype sequence resource within one week by leveraging existing technologies,
such as the Arvados Cloud platform (http://arvados.org), Common Workflow Language (CWL)
(http://commonwl.org), and the many free and open source software packages that are available
for bioinformatics. Pipelines developed by several teams were combined into an omnibus
pangenome analysis workflow. Computing resources for this project were generously donated by
Amazon Web Services.

Speakers
avatar for Andrea Guarracino

Andrea Guarracino

PhD student, Centre for Molecular Bioinformatics, Department of Biology, University Of Rome Tor Vergata, Rome, Italy.



Tuesday July 21, 2020 13:05 - 13:10 EDT
Joint

13:10 EDT

Designing and executing workflows for virtual screening of the SARS-CoV-2 main protease 🍐
β†’ Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Simon Bray 1, Tim Dudgeon 2, BjΓΆrn GrΓΌning 3

1 Bioinformatics Group, University of Freiburg. Email: sbray@informatik.uni-
freiburg.de
2 Informatics Matters, Oxford.
3 Bioinformatics Group, University of Freiburg.

Project Website: https://cheminformatics.usegalaxy.eu (webserver);
https://covid19.galaxyproject.org/cheminformatics (project documentation)
Source Code: https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox;
https://github.com/galaxycomputationalchemistry/galaxy-tools-compchem
License: Apache 2.0 (tools); MIT (documentation)

In silico analysis plays a vital role in drug discovery. Computational work allows
simulation and study of potential drug candidates on a scale that is impossible
experimentally. There are many different steps in virtual screening, each with a
different use-case, accuracy and computational cost. Thus, tools are often linked
into workflows, starting with low-accuracy, low-cost methods and filtering outputs
before applying high-accuracy, high-cost methods. The problem of how to link
multiple tools into a single workflow lends itself ideally to workflow management
systems such as Galaxy or Nextflow. There are already a wide range (over 100) of
cheminformatics tools available in Galaxy under the aegis of the ChemicalToolbox
(Bray et al., 2020, submitted, available at https://cheminformatics.usegalaxy.eu).
These draw on a range of open-source cheminformatics (OpenBabel, RDKit),
molecular dynamics (GROMACS) and molecular docking (rDock, AutoDock Vina)
libraries.

Here, we present a case study in which the ChemicalToolbox was used to assemble
and run workflows for virtual screening of the SARS-CoV-2 main protease (Mpro).
Mpro is one of the main druggable targets for the SARS-CoV-2 virus, and in March
2020 the Diamond Light Source released the results of a crystallographic fragment
screen, providing crystal structures of Mpro in complex with over 50 small organic
molecules. These were used to calculate a list of ~40k candidate molecules and
Galaxy workflows were constructed for charge enumeration, 3D conformer
generation, docking into the Mpro active site with rDock, pose scoring with the
TransFS deep learning approach, and validation against the experimental
structures using the SuCOS structural similarity measure. Running these
workflows led to selection of a shortlist of 500 most promising molecules, to be
purchased for further in vitro study. During the course of the study, over 25
CPU/GPU years, provided by a Europe-wide compute infrastructure network, were
used to generate and score ~80 million docking poses.

In this presentation, the ChemicalToolbox will be briefly introduced, followed by
discussion of the workflow and results.

Speakers
avatar for Simon Bray

Simon Bray

University of Freiburg
I'm a member of the European Galaxy Team at the University of Freiburg, interested in computational chemistry, molecular dynamics, and the use of workflow management systems for virtual screening.



Tuesday July 21, 2020 13:10 - 13:25 EDT
Joint

13:25 EDT

Analyzing the Nanopore data of SARS-CoV-2 within the Galaxy framework πŸŒ€
➞ Abstract

Nathan Roach 1, Milad Miladi 2, Florian Heyl 2, Stephan Flemming 2, Bjoern Grüning 2, Anton Nekrutenko 3

  1. Department of Biology, Johns Hopkins University, Baltimore MD, USA.
  2. Department of Computer Science, Albert-Ludwigs-UniversitΓ€t Freiburg, Freiburg, Germany,
  3. Department of Biochemistry and Molecular Biology, Penn State University, University Park PA, USA.

The presenter(s) will be available for live Q&A at the end of this session (BCC West).

Speakers
NR

Nathan Roach

Computational Biologist, Galaxy Works



Tuesday July 21, 2020 13:25 - 13:40 EDT
Joint

13:40 EDT

Q&A
The presenter(s) will be available for live Q&A in this session.

Moderators
Tuesday July 21, 2020 13:40 - 13:45 EDT
Joint

14:00 EDT

West Keynote 3: Biased by Default: Exploring Discrimination in Research Code
Abigail Cabunoc Mayes, Mozilla Foundation

This keynote will be delivered live.

Abstract

Artificial intelligence and machine learning have unlocked countless insights and data-driven results within the research community. At the same time, we’ve seen how AI contributes to diversity problems we face today, even amplifying bias within the scientific community.

In her talk, Abby will explore the benefits, harms and side effects AI has on researchers and society at large. Scientists have been at the forefront of ethics and tech with each new discovery, from the atomic bomb to personal genomes. The research community has an opportunity to act now: AI needs to be designed with the well being of humans – and humanity – in mind.


This session will be introduced by Yo Yehudi.

Speakers
avatar for Abigail Cabunoc Mayes

Abigail Cabunoc Mayes

Mozilla Foundation
Abigail Cabunoc Mayes is the Working Open Practice Lead at the Mozilla Foundation. Abby mobilizes leaders in the Internet health movement through mentorship and training on open practices and open source. Before this, she was Lead Developer of the Mozilla Science Lab, transformin... Read More →


Tuesday July 21, 2020 14:00 - 14:45 EDT
Joint

14:45 EDT

Closing remarks
Conference wrap up. Thank you for attending BCC2020!
Nomi's slides are here: https://docs.google.com/presentation/d/1_JzVnpm-9RnOd8iwknbOIsS3wWrRAxo3/edit#slide=id.p1

Moderators
avatar for Dave Clements

Dave Clements

Training and Outreach Coordinator, Galaxy Project, Johns Hopkins University
avatar for Nomi Harris

Nomi Harris

BOSC Chair, LBNL
This is my 10th year chairing or co-chairing BOSC, the Bioinformatics Open Source Conference.In 2020, BOSC is part of the online Bioinformatics Community Conference, BCC2020.

Speakers
avatar for Peter Cock

Peter Cock

The James Hutton Institute
Bioinformatician at the James Hutton Institute, a member of the BOSC organizing committee, treasurer of the Open Bioinformatics Foundation, and a core developer on the Biopython project.
avatar for Frederik Coppens

Frederik Coppens

VIB-UGent Center for Plant Systems Biology



Tuesday July 21, 2020 14:45 - 15:00 EDT
Joint
 
  • Timezone
  • Filter By Date BCC2020 Jul 16 -24, 2020
    • July 2020
      SunMonTueWedThuFriSat
       1234
      567891011
      12131415161718
      19202122232425
      262728293031 
  • Filter By Venue Online
  • Filter By Type
  • CoFest
  • Meeting-East
  • Meeting-West
  • Social / Networking
  • Training-East
  • Training-West
  • Category
  • Hemisphere


Filter sessions
Apply filters to sessions.