Loading…
BCC2020 has ended
➞ Set your timezone before doing anything else on this site (home page, on the right)
Limit what is shown by Type, Category, or Hemisphere
Registration closed July 15.

BCC2020 is online, global, and affordable. The meeting and training are now done, and the CoFest is under way.

The 2020 Bioinformatics Community Conference brings together the Bioinformatics Open Source Conference (BOSC) and the Galaxy Community Conference into a single event featuring training, a meeting, and a CollaborationFest. Events run from July 17 through July 25, and is held in both the eastern and western hemispheres.

Tuesday, July 21 • 02:01 - 02:15
Shesmu: A bioinformatics orchestration tool 🍐

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!



Abstract


The presenter(s) will be available for live Q&A in this session (BCC West).

Andre P. Masella 1, Heather E. Armstrong 2, Iain Bancarz 2, Dillan J. Cooke 2, Michael Laszloffy 2,
Angie Mosquera 2, Alexis V. Varsava 2, Morgan Taschuk 3

1 Ontario Institute for Cancer Research, Toronto, Canada. Email: andre.masella@oicr.on.ca
2 Ontario Institute for Cancer Research, Toronto, Canada.
3 Ontario Institute for Cancer Research, Toronto, Canada. Email: morgan.taschuk@oicr.on.ca

Project Website: https://oicr-gsi.github.io/shesmu
Source Code: https://github.com/oicr-gsi/shesmu
License: MIT License


Main Text of Abstract

In the ten years that Genome Sequence Informatics group has existed at OICR, our production
infrastructure and automation grew from a few small scripts to an unmanageable collection of
scripts in various languages, cron jobs, and servers. Each additional workflow brings along a new
collection of scripts that must orchestrate running that workflow. As we now have over 30
workflows, this creates additional load even though most jobs are doing similar work. Data that
fails to be picked up by the next workflow is difficult to track and errors in scripts cause delays in
the whole pipeline. Because each system was developed independently, debugging and logging is
inconsistent, if available at all. Many of these pieces of software started as clones, but divergence
over time makes it hard to apply bug fixes consistently across them. We created Shesmu as a way to
consolidate and simplify orchestration workflow scheduling, ticketing, data release, and QC
validation.

Shesmu ingests data from many user-specified tabular data sources and feeds it to olives, small SQL-
like programs that filter and group data, in order to produce actions. Actions communicate with
external systems to accomplish their task given the parameters provided by the olive. The standard
distribution includes plugins to integrate with several external systems, including Atlassian's JIRA,
remote servers via SSH, MongoDB, Prometheus, and GitHub as well as our internally developed
Niassa workflow engine, Pinery LIMS interface, and Guanyin reporting system. We have used
Shesmu to automate running Niassa and WDL workflows, generating reports, updating our QC data
warehouse, notifying operators about invalid data, requesting the lab enter missing required data,
and informing the lab of the current analysis progress.

Shesmu has slashed automation time from over two weeks to a few days. Additionally, Shesmu runs
faster, provides better feedback to developers, and allows easier control for operators. The reduced
development time for olives has also reduced the need for operators to run workflows manually; it
is faster and easier to fix an olive and redeploy it then it is run the workflow manually.
Furthermore, while writing workflow launching required experience with Java and the API of the
workflow engine, the simplified domain language for writing olives has increased the number of
developers by lowering the barrier to entry. Shesmu's memory requirements are consistent and
system scales very well; we went from 10 olives in March 2019 to 102 olives in March 2020 with
very little change in resource usage.

Shesmu has replaced many of our launchers and data processing cron jobs with olives that run
faster and provide a better experience for both our operators and developers.

Speakers
avatar for Andre Masella

Andre Masella

Sr. Software Developer, Ontario Institute for Cancer Research
I'm a programmer at Genome Sequence Informatics at the Ontario Institute for Cancer Research, supporting pipeline infrastructure and maintenance projects for our automated analysis.


Tuesday July 21, 2020 02:01 - 02:15 EDT
BOSC