→ AbstractThe presenter(s) will be available for live Q&A in this session (BCC West).
Nicholas Darci-Maher 1, Kerui Peng 3, Dat Duong 1, Richard J. Abdill 2, Eleazar Eskin 1, Serghei Mangul 3
1 University of California, Los Angeles, California, USA. Email:
niko.darcimaher@gmail.com2 University of Minnesota, Minnesota, USA
3 University of Southern California, California, USA
Methods code:
https://github.com/smangul1/data_reusabilityLicense: MIT License
Abstract
As today's high throughput sequencing techniques become increasingly affordable and accurate,
the number of publicly available omics datasets is rapidly accumulating. Bioinformatics methods provide
unprecedented opportunities for analysis of omics datasets in quantitative biological research.
Traditionally, such research has included primary analysis of novel omics data developed as part of the
study. However, this data has the potential to be reused, and is often valuable beyond the scope of the
study that introduced it. Data-driven research by secondary analysis on existing datasets is becoming
more important. Increased availability of public omics data represents an opportunity to find novel
insights and discoveries across different datasets.
This study presents a quantitative analysis of the reusability of omics datasets in two online
repositories, the Sequence Read Archive (SRA) and the Gene Expression Omnibus (GEO). We
downloaded over 2.5 million publications from the PubMed Central Open Access corpus, and identified
those that referenced SRA or GEO datasets. We used these papers to examine reusability based on various
factors, including journal, repository, sequencing technology, and species. We find that most datasets are
never reused--these datasets are mentioned once in the study that introduced them, but then never
referenced again. In recent years, however, data reuse is rising. We aim to shed light on the landscape of
data sharing in the quantitative biology research community, and illuminate the benefits of secondary
analysis of omics data.