Evolving Scientific Discovery by Unifying Data and Background Knowledge with AI Hilbert

Ryan Cory-Wright; Cristina Cornelio; Sanjeeb Dash; Bachir El Khadir; Lior Horesh

Evolving Scientific Discovery by Unifying Data and Background Knowledge with AI Hilbert

Ryan Cory-Wright, Cristina Cornelio, Sanjeeb Dash, Bachir El Khadir, Lior Horesh

TL;DR

AI-Hilbert offers a principled framework that unifies experimental data and background knowledge expressed as polynomial equalities and inequalities, solving a polynomial optimization problem to discover laws $q(m{x})=0$ that are simultaneously data-consistent and derivable from theory. It leverages sum-of-squares optimization and the Positivstellensatz to obtain certificates of correctness, enabling not only discovery but also formal proofs of consistency with background axioms. The approach demonstrates rediscovery of Kepler's Third Law, Einstein time dilation, Hagen-Poiseuille flow, gravitational wave power, and Bell inequalities under various combinations of complete, incomplete, and inconsistent axioms, and highlights a data-efficiency advantage when relevant theory is supplied. It also defines a distance measure $d^c(q, G\u2229 H)$ to navigate incomplete or conflicting background knowledge and discusses practical considerations for data-to-theory trade-offs and computational scalability. The results suggest AI-Hilbert as a promising step toward reliable, interpretable scientific discovery that can operate in regimes where data are scarce or theory is partial, with clear avenues for extending to nonpolynomial settings and automated hyperparameter tuning.

Abstract

The discovery of scientific formulae that parsimoniously explain natural phenomena and align with existing background theory is a key goal in science. Historically, scientists have derived natural laws by manipulating equations based on existing knowledge, forming new equations, and verifying them experimentally. In recent years, data-driven scientific discovery has emerged as a viable competitor in settings with large amounts of experimental data. Unfortunately, data-driven methods often fail to discover valid laws when data is noisy or scarce. Accordingly, recent works combine regression and reasoning to eliminate formulae inconsistent with background theory. However, the problem of searching over the space of formulae consistent with background theory to find one that best fits the data is not well-solved. We propose a solution to this problem when all axioms and scientific laws are expressible via polynomial equalities and inequalities and argue that our approach is widely applicable. We model notions of minimal complexity using binary variables and logical constraints, solve polynomial optimization problems via mixed-integer linear or semidefinite optimization, and prove the validity of our scientific discoveries in a principled manner using Positivstellensatz certificates. The optimization techniques leveraged in this paper allow our approach to run in polynomial time with fully correct background theory under an assumption that the complexity of our derivation is bounded), or non-deterministic polynomial (NP) time with partially correct background theory. We demonstrate that some famous scientific laws, including Kepler's Third Law of Planetary Motion, the Hagen-Poiseuille Equation, and the Radiated Gravitational Wave Power equation, can be derived in a principled manner from axioms and experimental data.

Evolving Scientific Discovery by Unifying Data and Background Knowledge with AI Hilbert

TL;DR

that are simultaneously data-consistent and derivable from theory. It leverages sum-of-squares optimization and the Positivstellensatz to obtain certificates of correctness, enabling not only discovery but also formal proofs of consistency with background axioms. The approach demonstrates rediscovery of Kepler's Third Law, Einstein time dilation, Hagen-Poiseuille flow, gravitational wave power, and Bell inequalities under various combinations of complete, incomplete, and inconsistent axioms, and highlights a data-efficiency advantage when relevant theory is supplied. It also defines a distance measure

to navigate incomplete or conflicting background knowledge and discusses practical considerations for data-to-theory trade-offs and computational scalability. The results suggest AI-Hilbert as a promising step toward reliable, interpretable scientific discovery that can operate in regimes where data are scarce or theory is partial, with clear avenues for extending to nonpolynomial settings and automated hyperparameter tuning.

Abstract

Paper Structure (40 sections, 1 theorem, 68 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 40 sections, 1 theorem, 68 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Literature Review
Sum-of-Squares Optimization:
Data-Driven Approaches to Scientific Discovery:
Contributions and Structure
Background and Notation
Structure
Discovering Scientific Formulae Via Polynomial Optimization
Method Overview
Overall Problem Setting
Implicit and Explicit Symbolic Discovery:
Complexity of Scientific Discovery:
Distance to Background Theory and Model Complexity
Incomplete Case:
Inconsistent Case:
...and 25 more sections

Key Result

Theorem 1

Consider the basic (semi)algebraic sets where $g_i, h_j \in \mathbb{R}[x]_n$, and $\mathcal{G}$ satisfies the Archimedean propertyThis assumption is stronger than the compactness assumption on $\mathcal{G}$ found in the Positivstellensatz of schmudgen1991moment, but is typically not restrictive in practice, as one could assume that $g_1(x holds if and only if there exist SOS polynomials $\alpha_0

Figures (5)

Figure 1: Comparison of scientific discovery paradigms. Traditional scientific discovery formulates hypotheses using existing theory and observed phenomena. These hypotheses are validated and tested using data. In contrast, machine learning methods rely on large datasets to identify patterns. AI-Descartes cornelio2021ai proposes an inversion of the conventional scientific discovery paradigm. It generates hypotheses from observed data and validates them against known theory. However, in AI-Descartes, theory and data remain disjoint and do not mutually enhance one another. In contrast, our work, AI Hilbert, combines data and theory to formulate hypotheses. Unlike conventional methods, insights in data and knowledge embedded in the theory collaboratively reduce the search space. These two components complement each other: theory compensates for noisy or sparse data, while data compensates for inconsistent or incomplete theory. Note that blue denotes components associated with data, purple denotes components linked to theory, and dashed lines represent iterative processes.
Figure 2: The scientific method with scientific discoveries made via classical methods, data-driven methods, or AI-Hilbert. AI-Hilbert proposes scientific laws consistent with a body of background theory formally articulated as polynomial equalities, inequalities, and relevant data sources. This likely allows scientific discoveries to be made using fewer data points than state-of-the-art approaches, and for missing scientific axioms to be deduced via abductive reasoning as part of the scientific discovery process. On the other hand, existing approaches to scientific discovery propose laws that may be inconsistent with either background theory or existing data sources.
Figure 3: Schematic illustration of AI Hilbert and its components. Using background knowledge encoded as multivariate polynomials, experimental data, and hyperparameters (e.g., a sparsity constraint on the background theory) to control our model's complexity, we formulate scientific discovery as a polynomial optimization problem, reformulate it as a semidefinite optimization problem, and solve it to obtain both a symbolic model and its formal derivation. Dashed boxes correspond to optional components. An example is introducing an incorrect candidate formula as a new axiom in the background theory.
Figure 4: Illustration of the Positivstellensatz and its ability to recover the Radiation Gravitational Wave Power Equation in the special case where $m_1=m_2=c=G=1$. Keeping other variables constant, the points that obey the power equation are the intersection of the points that obey Kepler's Third Law and the points of a linearized equation from general relativity, and the wave equation is recoverable by adding these other equations with appropriate polynomial multipliers.
Figure 5: Coefficient distance between scientific formula derived by AI-Hilbert and ground truth vs. the number of data points when we omit some axioms (left) or all axioms (right).

Theorems & Definitions (4)

Theorem 1: Putinar's Positivstellensatz putinar1993positive, see also Theorem 5.1 of parrilo2003semidefinite
Example 1
Example 2
Example 3

Evolving Scientific Discovery by Unifying Data and Background Knowledge with AI Hilbert

TL;DR

Abstract

Evolving Scientific Discovery by Unifying Data and Background Knowledge with AI Hilbert

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)