The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center (DCC) has produced a set of high-quality analysis pipelines that are used by the ENCODE Consortium and have been released to the community. The pipelines are described with the Workflow Description Language (WDL) and use containerization to enhance reproducibility. To increase the usability and dissemination of these pipelines further we have developed a web interface on Truwl (https://truwl.com/) for specifying parameters and inputs for the ENCODE atac-seq pipeline. The pipeline can be executed directly from the web interface on Google Cloud Platform (GCP). Once compute jobs are successfully executed, the analysis is posted back to Truwl to allow others to view the parameters, inputs, and outputs of previously executed pipelines. Automatically posting previously executed jobs provides increased transparency of computational experiments and provides examples for others to follow. All content on Truwl is open-access, web-searchable, and has unique identifiers making it easy to find and easy to share. In this software demonstration we will show the use of the atac-seq pipeline from Truwl by both specifying the parameters and inputs from the web interface individually and reusing a previously posted analysis.
Advances in high throughput sequencing have increased the need for tools that aid in data storage,
analysis, annotation, and visualization. Many such tools are available, but their usability and
accessibility vary. To make essential tools more accessible, the bioinformatics community has
coalesced around the idea of using cloud-based platforms to provide access to computational power
and data storage resources. CyVerse is a multi-institution project focused on supporting life science
research by providing user-friendly access to national cyberinfrastructure resources, including HPC
clusters and storage infrastructure. As part of this effort, CyVerse developers built the Terrain
Application Programmer Interfaces (APIs), which offer programmatic access to these resources.
One important limitation of the CyVerse ecosystem, however, is that there is currently no easy way
for researchers to visualize genomic data sets stored in CyVerse accounts. This is problematic
because visualization is essential for all aspects of data analysis, from validating the output of
algorithms to detecting biologically meaningful patterns in data.
BioViz Connect solves this problem by connecting CyVerse resources to Integrated Genome
Browser, a full-featured, open source, visualization tool for genomics used by thousands of
researchers worldwide. BioViz Connect uses Terrain APIs to forward data from CyVerse into IGB.
The BioViz Connect interface (Figure 1) lets users annotate data files with key meta-data, notably
the version of reference genomes used in data analysis. Users can also run compute-intensive visual
analytics tasks and then display the results in IGB. To our knowledge, no other group has yet
experimented with using Terrain for application development outside of the CyVerse team.
Matúś Kalaš 1, Hervé Ménager 2, Alban Gaignard 3, Veit Schwämmle 4, Jon Ison 5, and the EDAM contributors and advisors
1. University of Bergen, Norway
2. Institut Pasteur, Paris, France
3. Univerity of Nantes, France
4. University of Southern Denmark, Ødense, Denmark
5. French Institute of Bioinformatics (ELIXIR France)
Project website: https://edamontology.org
Source code: https://github.com/edamontology/edamontology
License: CC BY-SA 4.0
EDAM is an ontology of well-established, familiar concepts that are prevalent within bioinformatics, and bioscientific data analysis in general [1,2]. The scope of EDAM includes types of data and data identifiers, data formats, operations, and topics. EDAM has a relatively simple structure, and comprises a set of concepts with terms, synonyms, definitions, relations, links, persistent identifiers, and some additional information (especially for data formats).
EDAM is developed in a participatory and transparent fashion, within a growing international community of contributors. The development of EDAM is coordinated with the development and curation of tools registries (e.g. bio.tools and BIII.eu); registries of training materials (e.g. TeSS); with packaging of open-source bioinformatics software (especially Debian Med [3]); the Common Workflow Language [4]; and other related communities and initiatives. These include the developers’ community of Galaxy [5], and collaborations with specialised networks of experts, such as within the development of EDAM-bioimaging [6]. EDAM-bioimaging is an extension of EDAM towards bioimage informatics and machine learning, where a broad group of experts in bioimaging, image analysis, and deep learning has been contributing to the common effort. The comprehensive but concise inclusion of machine learning topics is one of the new additions in 2020.The latest release of EDAM at the time of publication was version 1.24 [7], and EDAM-bioimaging version alpha06 [8].
In summary, EDAM functions as common controlled vocabulary when publishing, sharing, and integrating information about bioinformatics tools, workflows, training materials, and other resources. In addition, EDAM is also useful when choosing terminology, for data provenance, and in text mining (e.g. EDAMmap).
Poster published in F1000Research on 6 Jun 2020. https://doi.org/10.7490/f1000research.1117983.1
Video presentation: https://youtu.be/Jq16bnq8kbk
Viruses replicating within a host exist as a collection of closely related genetic variants known as viral haplotypes. The diversity in a viral population, or quasispecies, is due to mutations (insertions, deletions or substitutions) or recombination events that occur during virus replication. These haplotypes differ in relative frequencies and together play an important role in the fitness and evolution of the viral population. This variation in viral sequences poses a challenge to vaccine design and drug development. We present ViPRA-Haplo, a de novo assembly algorithm for reconstructing viral haplotypes in a virus population from paired-end next generation sequencing (NGS) data. The proposed Viral Path Reconstruction Algorithm (ViPRA) generates a subset of paths from a De Bruijn graph of reads using the pairing information of reads. These paths represent contigs of the virus. The paths generated by ViPRA are an over-estimation of the possible contigs. We then propose two methods to obtain an optimal set of contigs representing the viral haplotypes. The first method uses VSEARCH to cluster the paths reconstructed by ViPRA. The centroid in each cluster represents a contig. Second, we proposed a method MLEHaplo that generates a maximum likelihood estimate of the viral populations using the ViPRA paths. We evaluate and compare ViPRA-Haplo on a simulated data set, on a real HIV MiSeq data set (SRR961514) with sequencing errors, and on an emerging SARS-CoV-2 real data set (SRR10903401). In the simulated data, ViPRA-Haplo reconstructs full length viral haplotypes having a 99.7% sequence identity to the true viral haplotypes at 250x sequencing coverage. In the real NGS data, error correction software Karect is used to improve de novo assembly. The real HIV data set contains 714,994 pairs (2x250 bp) of reads that cover the five strains to 20,000x. Our method can reconstruct contigs that cover over 90% of each strain of the reference genomes, which is higher than the benchmark tool PEHaplo. In the SARS-CoV-2 data, after filtering for SARS-CoV-2 contigs using the metagenomic classifier Centrifuge, the contigs reconstructed by our method cover over 99% of the reference genome. The comparisons on both simulated and real data show that ViPRA-Haplo outperforms the existing tools by a higher coverage in reference genome(s), and in retaining the variation in viral sequence present naturally in the viral population.
Human pluripotent stem cells, derived from embryos or fetal tissue, are providing new opportunities to understand changes in gene regulation. Here we introduce a visualization tool that can be used to investigate how physical locations of genes on a chromosome relate to changes in significant gene expression by cell types. We present the results of an experiment in which induced pluripotent stem cells (iPs) were generated from human umbilical vein endothelial cells (HUVEC), and then differentiated back into endothelial cells (EC-Diff) as well as into neuronal cells (Nn-Diff). Our tool encodes significant changes in gene expression and allows for investigation of genes by their ontological classification and physical location. Observing the relationship between location, gene ontology, and expression level across cell types can assist in the identification of patterns in gene regulation changes. This novel tool can shed light on why stem cells differentiate into one cell type over another, with applications for modeling and treatment in the realms of neurodegeneration and cardiovascular disease. Our results have the potential to bridge the gap between complicated datasets resulting from experiments on cells, and biologists with the domain knowledge to properly draw conclusions about pathway activation.
Matúś Kalaš 1, Hervé Ménager 2, Alban Gaignard 3, Veit Schwämmle 4, Jon Ison 5, and the EDAM contributors and advisors
1. University of Bergen, Norway
2. Institut Pasteur, Paris, France
3. Univerity of Nantes, France
4. University of Southern Denmark, Ødense, Denmark
5. French Institute of Bioinformatics (ELIXIR France)
Project website: https://edamontology.org
Source code: https://github.com/edamontology/edamontology
License: CC BY-SA 4.0
EDAM is an ontology of well-established, familiar concepts that are prevalent within bioinformatics, and bioscientific data analysis in general [1,2]. The scope of EDAM includes types of data and data identifiers, data formats, operations, and topics. EDAM has a relatively simple structure, and comprises a set of concepts with terms, synonyms, definitions, relations, links, persistent identifiers, and some additional information (especially for data formats).
EDAM is developed in a participatory and transparent fashion, within a growing international community of contributors. The development of EDAM is coordinated with the development and curation of tools registries (e.g. bio.tools and BIII.eu); registries of training materials (e.g. TeSS); with packaging of open-source bioinformatics software (especially Debian Med [3]); the Common Workflow Language [4]; and other related communities and initiatives. These include the developers’ community of Galaxy [5], and collaborations with specialised networks of experts, such as within the development of EDAM-bioimaging [6]. EDAM-bioimaging is an extension of EDAM towards bioimage informatics and machine learning, where a broad group of experts in bioimaging, image analysis, and deep learning has been contributing to the common effort. The comprehensive but concise inclusion of machine learning topics is one of the new additions in 2020.The latest release of EDAM at the time of publication was version 1.24 [7], and EDAM-bioimaging version alpha06 [8].
In summary, EDAM functions as common controlled vocabulary when publishing, sharing, and integrating information about bioinformatics tools, workflows, training materials, and other resources. In addition, EDAM is also useful when choosing terminology, for data provenance, and in text mining (e.g. EDAMmap).
Poster published in F1000Research on 6 Jun 2020. https://doi.org/10.7490/f1000research.1117983.1
Video presentation: https://youtu.be/Jq16bnq8kbk
Understanding DNA sequences has been an ongoing endeavour within bioinformatics research. Recognizing the functionality of DNA sequences is a non-trivial and complex task that can bring insights into understanding DNA. This project explored deep learning models for recognizing gene regulating regions of DNA - more specifically, promoters. Our project delves into implementing current models from the literature to replicate their results and explore how the models might be recognizing promoters. Literature in this field typically include web applications for the community to use, where one can submit limited data to obtain the model’s result. This has become the standard in the field, making it rare for the authors to provide the source code of their work. This can create unnecessary obstacles for new research into the field.
Previous work has also focused on limited curated datasets to both train and evaluate their models using cross-validation, obtaining high-performing results across a variety of metrics. We implemented various models from the literature and compared them against each other, using their datasets interchangeably throughout the comparison tests. These comparisons highlight shortcomings within the training and testing datasets for these models, prompting us to create a robust promoter recognition testing dataset and develop a testing methodology that creates a wide variety of testing datasets for promoter recognition.
It is then possible to test and analyse the models from the literature with the newly created datasets, which provides a standard benchmark that mimics a realistic scenario. To avoid replicability and model comparability issues in the future, we open-source our findings and testing methodology. New deep learning (DL) models can be implemented as Pytorch modules, while other machine learning (ML) models can be implemented using sklearn. Both can be trained and tested using sklearn’s procedures, where DL models can make use of skorch as a wrapper around PyTorch with an added sklearn interface. Training and testing scripts for DL models have been added as examples and can be expanded by the open source community. While we focus on DL models in this project, our training and testing scripts are also applicable to other ML models.