Table of Contents
Fetching ...

A Declarative Query Language for Scientific Machine Learning

Hasan M Jamil

TL;DR

A Declarative Query Language for Scientific Machine Learning (MQL) introduces a SQL-like, declarative ML interface designed for naïve users to define data wrangling, model construction, and ML analysis without relying on low-level code. The paper formalizes three statements—GENERATE, CONSTRUCT, and INSPECT—with translational semantics that map to standard back-ends (illustrated via SciKit-Learn) and demonstrates two MatFlow-driven materials-science experiments (quantum dye design and lipid membrane bending modulus) to validate practicality. It presents a translational mapping τ and a three-part translation pipeline, discusses dependencies among statements, and analyzes results from initial experiments while acknowledging current limitations and future work (visualization, optimizer, PostgreSQL back-end, and LLM-based front-ends). Overall, MQL aims to lower entry barriers to ML in science by providing a high-level, declarative paradigm amenable to automated code synthesis and cross-framework back-ends, with early demonstrations showing feasibility and potential impact in materials design. The work highlights, yet also contends with, the need for improved optimization, richer visualization, and broader back-end support to realize full declarativity and performance advantages.

Abstract

The popularity of data science as a discipline and its importance in the emerging economy and industrial progress dictate that machine learning be democratized for the masses. This also means that the current practice of workforce training using machine learning tools, which requires low-level statistical and algorithmic details, is a barrier that needs to be addressed. Similar to data management languages such as SQL, machine learning needs to be practiced at a conceptual level to help make it a staple tool for general users. In particular, the technical sophistication demanded by existing machine learning frameworks is prohibitive for many scientists who are not computationally savvy or well versed in machine learning techniques. The learning curve to use the needed machine learning tools is also too high for them to take advantage of these powerful platforms to rapidly advance science. In this paper, we introduce a new declarative machine learning query language, called {\em MQL}, for naive users. We discuss its merit and possible ways of implementing it over a traditional relational database system. We discuss two materials science experiments implemented using MQL on a materials science workflow system called MatFlow.

A Declarative Query Language for Scientific Machine Learning

TL;DR

A Declarative Query Language for Scientific Machine Learning (MQL) introduces a SQL-like, declarative ML interface designed for naïve users to define data wrangling, model construction, and ML analysis without relying on low-level code. The paper formalizes three statements—GENERATE, CONSTRUCT, and INSPECT—with translational semantics that map to standard back-ends (illustrated via SciKit-Learn) and demonstrates two MatFlow-driven materials-science experiments (quantum dye design and lipid membrane bending modulus) to validate practicality. It presents a translational mapping τ and a three-part translation pipeline, discusses dependencies among statements, and analyzes results from initial experiments while acknowledging current limitations and future work (visualization, optimizer, PostgreSQL back-end, and LLM-based front-ends). Overall, MQL aims to lower entry barriers to ML in science by providing a high-level, declarative paradigm amenable to automated code synthesis and cross-framework back-ends, with early demonstrations showing feasibility and potential impact in materials design. The work highlights, yet also contends with, the need for improved optimization, richer visualization, and broader back-end support to realize full declarativity and performance advantages.

Abstract

The popularity of data science as a discipline and its importance in the emerging economy and industrial progress dictate that machine learning be democratized for the masses. This also means that the current practice of workforce training using machine learning tools, which requires low-level statistical and algorithmic details, is a barrier that needs to be addressed. Similar to data management languages such as SQL, machine learning needs to be practiced at a conceptual level to help make it a staple tool for general users. In particular, the technical sophistication demanded by existing machine learning frameworks is prohibitive for many scientists who are not computationally savvy or well versed in machine learning techniques. The learning curve to use the needed machine learning tools is also too high for them to take advantage of these powerful platforms to rapidly advance science. In this paper, we introduce a new declarative machine learning query language, called {\em MQL}, for naive users. We discuss its merit and possible ways of implementing it over a traditional relational database system. We discuss two materials science experiments implemented using MQL on a materials science workflow system called MatFlow.
Paper Structure (18 sections, 6 figures, 4 algorithms)

This paper contains 18 sections, 6 figures, 4 algorithms.

Figures (6)

  • Figure 1: MQL query for median home value prediction.
  • Figure 2: SciKit-Learn Python code for MQL query in Fig \ref{['mql']}.
  • Figure 3: Test data input table homesNew.
  • Figure 4: Predicted home median values.
  • Figure 5: MQL operational model.
  • ...and 1 more figures