→ AbstractThe presenter(s) will be available for live Q&A in this session (BCC West).
Nils Hoffmann 1, Dominik Kopczynski 1, Bing Peng 2, Robert Ahrends 3
1 Leibniz-Institut für Analytische Wissenschaften ISAS e.V., Otto-Hahn-Straße 6b, 44227
Dortmund, Germany. Email:
nils.hoffmann@isas.de2 Karolinska Institutet, Solna, Stockholm, Sweden.
3 Department of Analytical Chemistry, University of Vienna, Vienna, Austria.
Project Website:
https://lifs.isas.de/goslin &
https://apps.lifs.isas.de/goslinSource Code:
https://github.com/lifs-tools/goslin (main hub to implementations)
License:
Apache v2 LICENSE &
MIT LicenseMain Text of Abstract
We introduce the 'Grammar of Succinct Lipid Nomenclature' (Goslin), a polyglot grammar for
common lipid shorthand nomenclatures based on the LipidMaps nomenclature and the shorthand
nomenclature established by Liebisch et al. and used by LipidHome and SwissLipids, accompanied
by parser implementations in C++, Java, Python and R.
Lipid naming has evolved into several dialects which complicates the unified computational
treatment and parsing of lipid names. As a consequence, long and error-prone manual curation
often is necessary in order to streamline lists of lipid names for their processing in follow-up
analysis scripts, workflows, or tools, or for their submission to research data repositories. Goslin
was designed to address the following pressing issues in the lipidomics field especially: 1) to
simplify the implementation of lipid name handling for developers of mass spectrometry-based
lipidomics tools; 2) to offer a tool that unifies and normalizes the main existing lipid name dialects
enabling a lipidomics analysis in a high-throughput fashion.
Goslin and its parser implementations are thus designed to act as a library for the development of
lipidomics tools providing a standardized data structure for storing structural lipid information.
The parsing of lipid names as well as the lipid name generation are the main functions of Goslin. We
therefor defined a context free grammar (with ANTLR4) that defines rules and productions for all
structural properties of the lipid nomenclature, including mass spectrometry specific information
about unlabeled and heavy isotope labeled species, as well as fragments and adducts. We recently
added the calculation of masses and sum formulas, when the head group's sum composition is
known. Currently, the grammar covers 289 lipid classes within the seven most occurring lipid
categories in eukaryotic organisms, namely fatty acyls, glycerolipids, glycerophospholipids,
saccharolipids, sphingolipids, sterol lipids, and polyketides. The major advantages of using a
grammar rather than a manually coded parser are its flexibility and extensibility. Regular
expressions are also not suitable for parsing lipid names, since they are incapable of recognizing
nested patterns and can only recognize words from regular languages.
We provide implementations of Goslin in four major programming languages, namely C++, Java,
Python 3, and R to kick-start adoption and integration. Further, we set up a web service for users to
work with Goslin directly and via an OpenAPI-compliant REST API. All implementations are
available free of charge under a permissive open source license, binary releases are available from
Zenodo. We are currently working on making the libraries available via BioConda/BioContainers
and other community-facing repositories.