Sustainable data and metadata management at the BD2K-LINCS Data coordination and integration center

Vasileios Stathias, Amar Koleti, Dusica Vidovic, Daniel J. Cooper, Kathleen M. Jagodnik, Raymond Terryn, Michele Forlin, Caty Chung, Denis Torre, Nagi Ayad, Mario Medvedovic, Avi Ma'Ayan, Ajay Pillai, Stephan C Schuerer

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

The NIH-funded LINCS Consortium is creating an extensive reference library of cell-based perturbation response signatures and sophisticated informatics tools incorporating a large number of perturbagens, model systems, and assays. To date, more than 350 datasets have been generated including transcriptomics, proteomics, epigenomics, cell phenotype and competitive binding profiling assays. The large volume and variety of data necessitate rigorous data standards and effective data management including modular data processing pipelines and end-user interfaces to facilitate accurate and reliable data exchange, curation, validation, standardization, aggregation, integration, and end user access. Deep metadata annotations and the use of qualified data standards enable integration with many external resources. Here we describe the end-to-end data processing and management at the DCIC to generate a high-quality and persistent product. Our data management and stewardship solutions enable a functioning Consortium and make LINCS a valuable scientific resource that aligns with big data initiatives such as the BD2K NIH Program and concords with emerging data science best practices including the findable, accessible, interoperable, and reusable (FAIR) principles.

Original languageEnglish (US)
Article number180117
JournalScientific data
Volume5
DOIs
StatePublished - Jun 19 2018

Fingerprint

Metadata
Information management
Data Management
Assays
management
Electronic data interchange
Standardization
User interfaces
Agglomeration
Pipelines
Resources
Best Practice
Proteomics
Cell
Data Exchange
Profiling
Phenotype
User Interface
Annotation
Aggregation

ASJC Scopus subject areas

  • Statistics and Probability
  • Information Systems
  • Education
  • Computer Science Applications
  • Statistics, Probability and Uncertainty
  • Library and Information Sciences

Cite this

Sustainable data and metadata management at the BD2K-LINCS Data coordination and integration center. / Stathias, Vasileios; Koleti, Amar; Vidovic, Dusica; Cooper, Daniel J.; Jagodnik, Kathleen M.; Terryn, Raymond; Forlin, Michele; Chung, Caty; Torre, Denis; Ayad, Nagi; Medvedovic, Mario; Ma'Ayan, Avi; Pillai, Ajay; Schuerer, Stephan C.

In: Scientific data, Vol. 5, 180117, 19.06.2018.

Research output: Contribution to journalArticle

Stathias, V, Koleti, A, Vidovic, D, Cooper, DJ, Jagodnik, KM, Terryn, R, Forlin, M, Chung, C, Torre, D, Ayad, N, Medvedovic, M, Ma'Ayan, A, Pillai, A & Schuerer, SC 2018, 'Sustainable data and metadata management at the BD2K-LINCS Data coordination and integration center', Scientific data, vol. 5, 180117. https://doi.org/10.1038/sdata.2018.117
Stathias, Vasileios ; Koleti, Amar ; Vidovic, Dusica ; Cooper, Daniel J. ; Jagodnik, Kathleen M. ; Terryn, Raymond ; Forlin, Michele ; Chung, Caty ; Torre, Denis ; Ayad, Nagi ; Medvedovic, Mario ; Ma'Ayan, Avi ; Pillai, Ajay ; Schuerer, Stephan C. / Sustainable data and metadata management at the BD2K-LINCS Data coordination and integration center. In: Scientific data. 2018 ; Vol. 5.
@article{b276092b6d2a438ea416ac107612bfd1,
title = "Sustainable data and metadata management at the BD2K-LINCS Data coordination and integration center",
abstract = "The NIH-funded LINCS Consortium is creating an extensive reference library of cell-based perturbation response signatures and sophisticated informatics tools incorporating a large number of perturbagens, model systems, and assays. To date, more than 350 datasets have been generated including transcriptomics, proteomics, epigenomics, cell phenotype and competitive binding profiling assays. The large volume and variety of data necessitate rigorous data standards and effective data management including modular data processing pipelines and end-user interfaces to facilitate accurate and reliable data exchange, curation, validation, standardization, aggregation, integration, and end user access. Deep metadata annotations and the use of qualified data standards enable integration with many external resources. Here we describe the end-to-end data processing and management at the DCIC to generate a high-quality and persistent product. Our data management and stewardship solutions enable a functioning Consortium and make LINCS a valuable scientific resource that aligns with big data initiatives such as the BD2K NIH Program and concords with emerging data science best practices including the findable, accessible, interoperable, and reusable (FAIR) principles.",
author = "Vasileios Stathias and Amar Koleti and Dusica Vidovic and Cooper, {Daniel J.} and Jagodnik, {Kathleen M.} and Raymond Terryn and Michele Forlin and Caty Chung and Denis Torre and Nagi Ayad and Mario Medvedovic and Avi Ma'Ayan and Ajay Pillai and Schuerer, {Stephan C}",
year = "2018",
month = "6",
day = "19",
doi = "10.1038/sdata.2018.117",
language = "English (US)",
volume = "5",
journal = "Scientific data",
issn = "2052-4463",
publisher = "Nature Publishing Group",

}

TY - JOUR

T1 - Sustainable data and metadata management at the BD2K-LINCS Data coordination and integration center

AU - Stathias, Vasileios

AU - Koleti, Amar

AU - Vidovic, Dusica

AU - Cooper, Daniel J.

AU - Jagodnik, Kathleen M.

AU - Terryn, Raymond

AU - Forlin, Michele

AU - Chung, Caty

AU - Torre, Denis

AU - Ayad, Nagi

AU - Medvedovic, Mario

AU - Ma'Ayan, Avi

AU - Pillai, Ajay

AU - Schuerer, Stephan C

PY - 2018/6/19

Y1 - 2018/6/19

N2 - The NIH-funded LINCS Consortium is creating an extensive reference library of cell-based perturbation response signatures and sophisticated informatics tools incorporating a large number of perturbagens, model systems, and assays. To date, more than 350 datasets have been generated including transcriptomics, proteomics, epigenomics, cell phenotype and competitive binding profiling assays. The large volume and variety of data necessitate rigorous data standards and effective data management including modular data processing pipelines and end-user interfaces to facilitate accurate and reliable data exchange, curation, validation, standardization, aggregation, integration, and end user access. Deep metadata annotations and the use of qualified data standards enable integration with many external resources. Here we describe the end-to-end data processing and management at the DCIC to generate a high-quality and persistent product. Our data management and stewardship solutions enable a functioning Consortium and make LINCS a valuable scientific resource that aligns with big data initiatives such as the BD2K NIH Program and concords with emerging data science best practices including the findable, accessible, interoperable, and reusable (FAIR) principles.

AB - The NIH-funded LINCS Consortium is creating an extensive reference library of cell-based perturbation response signatures and sophisticated informatics tools incorporating a large number of perturbagens, model systems, and assays. To date, more than 350 datasets have been generated including transcriptomics, proteomics, epigenomics, cell phenotype and competitive binding profiling assays. The large volume and variety of data necessitate rigorous data standards and effective data management including modular data processing pipelines and end-user interfaces to facilitate accurate and reliable data exchange, curation, validation, standardization, aggregation, integration, and end user access. Deep metadata annotations and the use of qualified data standards enable integration with many external resources. Here we describe the end-to-end data processing and management at the DCIC to generate a high-quality and persistent product. Our data management and stewardship solutions enable a functioning Consortium and make LINCS a valuable scientific resource that aligns with big data initiatives such as the BD2K NIH Program and concords with emerging data science best practices including the findable, accessible, interoperable, and reusable (FAIR) principles.

UR - http://www.scopus.com/inward/record.url?scp=85048978010&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85048978010&partnerID=8YFLogxK

U2 - 10.1038/sdata.2018.117

DO - 10.1038/sdata.2018.117

M3 - Article

C2 - 29917015

AN - SCOPUS:85048978010

VL - 5

JO - Scientific data

JF - Scientific data

SN - 2052-4463

M1 - 180117

ER -