→ AbstractThe presenter(s) will be available for live Q&A in this session (BCC West).
Maarten JMF Reijnders 1,2 and Robert M. Waterhouse 1,2
1 University of Lausanne, Lausanne, Switzerland.
2 Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Email: maarten.reijnders@unil.ch
Source code:
https://gitlab.com/mreijnders/CrowdGOLicense: GNU General Public License v3.0
Methods to predict protein functions- defined here as assigning Gene Ontology (GO) terms -
vary considerably in their underlying approach, with different methods employing techniques
such as sequence homology, machine learning, or text mining. This often results in dramatically
different sets of GO terms predicted for the same sets of proteins. These methods are reviewed
in the Critical Assessment of Functional Annotation competitions (CAFA) (Zhou 2019), but even
the best scoring methods can be inaccurate, and none truly stand out. To concurrently exploit
the strengths of each method, we developed a meta-predictor that evaluates the predictions of
multiple top-performing methods.
CrowdGO compares the predictions of different methods and uses a machine learning model to
improve the precision, recall, and f-max scores of the resulting meta-predictions. The model can
be trained based on user-selected prediction methods, or a pre-trained model can be used. The
pre-trained models are built using prediction tools that are exclusively open-source, easy to use,
and computationally non-demanding. CrowdGO includes Snakemake workflows to use existing
models for GO term prediction, or to train new models.
Using a model built with four input predictions from a sequence homology- based predictor, Wei2GO (Reijnders 2020), two protein domain based predictors, InterProScan (Mitchell 2019) and FunFams (Scheibenreif 2019), and a deep learning predictor, DeepGOPlus (Kulmanov 2019), CrowdGO increases both the precision and meaningful recall compared to each input method (Figure 1).
CrowdGO is fully open source and leverages other open source tools.It is straightforward to use, both due to the simplistic nature of the software and the accompanying snakemake pipelines. Due to the nature of its meta-prediction algorithm, it will stay relevant even when improved function prediction software becomes
available.