BCC2020 has ended
➞ Set your timezone before doing anything else on this site (home page, on the right)
Limit what is shown by Type, Category, or Hemisphere
Registration closed July 15.

BCC2020 is online, global, and affordable. The meeting and training are now done, and the CoFest is under way.

The 2020 Bioinformatics Community Conference brings together the Bioinformatics Open Source Conference (BOSC) and the Galaxy Community Conference into a single event featuring training, a meeting, and a CollaborationFest. Events run from July 17 through July 25, and is held in both the eastern and western hemispheres.

Back To Schedule
Monday, July 20 • 02:20 - 02:25
Secondary analysis of publicly available omics data across almost 3 million publications 🍐

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!


The presenter(s) will be available for live Q&A in this session (BCC West).

Nicholas Darci-Maher 1, Kerui Peng 3, Dat Duong 1, Richard J. Abdill 2, Eleazar Eskin 1, Serghei Mangul 3

1 University of California, Los Angeles, California, USA. Email: niko.darcimaher@gmail.com
2 University of Minnesota, Minnesota, USA
3 University of Southern California, California, USA

Methods code: https://github.com/smangul1/data_reusability
License: MIT License

As today's high throughput sequencing techniques become increasingly affordable and accurate,
the number of publicly available omics datasets is rapidly accumulating. Bioinformatics methods provide
unprecedented opportunities for analysis of omics datasets in quantitative biological research.
Traditionally, such research has included primary analysis of novel omics data developed as part of the
study. However, this data has the potential to be reused, and is often valuable beyond the scope of the
study that introduced it. Data-driven research by secondary analysis on existing datasets is becoming
more important. Increased availability of public omics data represents an opportunity to find novel
insights and discoveries across different datasets.
This study presents a quantitative analysis of the reusability of omics datasets in two online
repositories, the Sequence Read Archive (SRA) and the Gene Expression Omnibus (GEO). We
downloaded over 2.5 million publications from the PubMed Central Open Access corpus, and identified
those that referenced SRA or GEO datasets. We used these papers to examine reusability based on various
factors, including journal, repository, sequencing technology, and species. We find that most datasets are
never reused--these datasets are mentioned once in the study that introduced them, but then never
referenced again. In recent years, however, data reuse is rising. We aim to shed light on the landscape of
data sharing in the quantitative biology research community, and illuminate the benefits of secondary
analysis of omics data.


Nicholas Darci-Maher

University of California, Los Angeles

Monday July 20, 2020 02:20 - 02:25 EDT