Table of Contents
Fetching ...

EmpireDB: Data System to Accelerate Computational Sciences

Daniel Alabi, Eugene Wu

TL;DR

The paper addresses the challenge of extracting value from large, domain-rich scientific data by integrating domain knowledge and approximation controls into a cohesive data management system. It proposes EmpireDB, a three-component architecture (query engine, execution pipelines, storage engines) with an SQL-like language extended for approximation and constraints, designed to support training, inference, and active learning. The authors illustrate the approach via the GNoME materials-discovery use case, detailing structural/compositional pipelines and DFT integration, and present preliminary evidence that EmpireDB optimizes model training (e.g., better accuracy) compared to non-adaptive baselines. They argue this framework could accelerate discovery across fields and outline future work on scaling, guarantees, and broader applicability.

Abstract

The emerging discipline of Computational Science is concerned with using computers to simulate or solve scientific problems. These problems span the natural, political, and social sciences. The discipline has exploded over the past decade due to the emergence of larger amounts of observational data and large-scale simulations that were previously unavailable or unfeasible. However, there are still significant challenges with managing the large amounts of data and simulations. The database management systems community has always been at the forefront of the development of the theory and practice of techniques for formalizing and actualizing systems that access or query large datasets. In this paper, we present EmpireDB, a vision for a data management system to accelerate computational sciences. In addition, we identify challenges and opportunities for the database community to further the fledgling field of computational sciences. Finally, we present preliminary evidence showing that the optimized components in EmpireDB could lead to improvements in performance compared to contemporary implementations.

EmpireDB: Data System to Accelerate Computational Sciences

TL;DR

The paper addresses the challenge of extracting value from large, domain-rich scientific data by integrating domain knowledge and approximation controls into a cohesive data management system. It proposes EmpireDB, a three-component architecture (query engine, execution pipelines, storage engines) with an SQL-like language extended for approximation and constraints, designed to support training, inference, and active learning. The authors illustrate the approach via the GNoME materials-discovery use case, detailing structural/compositional pipelines and DFT integration, and present preliminary evidence that EmpireDB optimizes model training (e.g., better accuracy) compared to non-adaptive baselines. They argue this framework could accelerate discovery across fields and outline future work on scaling, guarantees, and broader applicability.

Abstract

The emerging discipline of Computational Science is concerned with using computers to simulate or solve scientific problems. These problems span the natural, political, and social sciences. The discipline has exploded over the past decade due to the emergence of larger amounts of observational data and large-scale simulations that were previously unavailable or unfeasible. However, there are still significant challenges with managing the large amounts of data and simulations. The database management systems community has always been at the forefront of the development of the theory and practice of techniques for formalizing and actualizing systems that access or query large datasets. In this paper, we present EmpireDB, a vision for a data management system to accelerate computational sciences. In addition, we identify challenges and opportunities for the database community to further the fledgling field of computational sciences. Finally, we present preliminary evidence showing that the optimized components in EmpireDB could lead to improvements in performance compared to contemporary implementations.

Paper Structure

This paper contains 15 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: An illustration of system components in EmpireDB.
  • Figure 2: Illustration of the systems architecture of GNoME-based materials discovery MBSACC23.