Semantic Data Lake for Biomedical Research

Project: Research project


Capitalizing on the transformative opportunities afforded by the extremely large and ever-growing volume, velocity, and variety of biomedical data being continuously produced is a major challenge. The development and increasingly widespread adoption of several new technologies, including next generation genetic sequencing, electronic health records and clinical trials systems, and research data warehouses means that we are in the midst of a veritable explosion in data production. This in turn results in the migration of the bottleneck in scientific productivity into data management and interpretation: tools are urgently needed to assist cancer researchers in the assembly, integration, transformation, and analysis of these Big Data sets. In this project, we propose to develop the Semantic Data Lake for Biomedical Research (SDL-BR) system, a cluster-computing software environment that enables rapid data ingestion, multifaceted data modeling, logical and semantic querying and data transformation, and intelligent resource discovery. SDL-BR is based on the idea of a data lake, a distributed store that does not make any assumptions about the structure of incoming data, and that delays modeling decisions until data is to be used. This project adds to the data lake paradigm methods for semantic data modeling, integration, and querying, and for resource discovery based on learned relationships between users and data resources.
Effective start/end date9/15/162/29/20


  • National Institutes of Health: $589,741.00
  • National Institutes of Health: $50,000.00
  • National Institutes of Health: $614,169.00
  • National Institutes of Health: $221,175.00


Data structures
Cluster computing
Data warehouses
Information management