Table of Contents
Fetching ...

An information-matching approach to optimal experimental design and active learning

Yonatan Kurniawan, Tracianne B. Neilsen, Benjamin L. Francis, Alex M. Stankovic, Mingjian Wen, Ilia Nikiforov, Ellad B. Tadmor, Vasily V. Bulatov, Vincenzo Lordi, Mark K. Transtrum

TL;DR

This work introduces an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool, formulated as a convex optimization problem, making it scalable to large models and datasets.

Abstract

The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.

An information-matching approach to optimal experimental design and active learning

TL;DR

This work introduces an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool, formulated as a convex optimization problem, making it scalable to large models and datasets.

Abstract

The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.

Paper Structure

This paper contains 5 sections, 1 theorem, 5 equations, 4 figures, 1 algorithm.

Key Result

Theorem 1

Let $\bm{g}(\pmb{\theta}; \mathbf{y})$ denote a mapping from the model parameters $\pmb{\theta}$ to the QoIs for input $\mathbf{y}$ that is analytic at $\pmb{\theta}_0 = \left< \pmb{\theta} \right>_{\pmb{\theta}}$, where $\left< \cdot \right>_{\pmb{\theta}}$ denotes an expectation value over the dis where $\mathbf{\Sigma}$ is the target covariance of the QoIs.

Figures (4)

  • Figure 1: Relationship between training data (left), model parameters (middle), and QoIs (right) in the information-matching framework. One first selects the target precision for the QoIs (blue envelope in the right panel). This QoI precision induces a minimal confidence region in parameter space (blue ellipse in the middle panel). The information-matching criterion selects training data and target precision (orange envelope in the left panel) such that the resulting parameter uncertainty (orange ellipse in the middle panel) is more restrictive than that induced by the QoIs. Propagating the parameter uncertainty to the QoIs gives predictions that are at least as precise as the original target (orange envelope in the right panel). This relationship holds even if the target uncertainties were divergent for certain QoIs (dashed blue curves in the right panel), resulting in the target parameter confidence diverging for some parameter combinations (dashed blue ellipse in the middle panel, extending in some directions).
  • Figure 2: The IEEE 39-bus system. Buses are represented by thick black lines, while transmission lines and transformers are shown as thin lines connecting them. Loads are depicted as black arrows pointing outward from the buses, and the circled labels G1 through G10 indicate the generators. Buses highlighted in orange denote the optimal PMU placements for full observability of the entire network. Buses highlighted in other colors (red, green, and blue) represent the optimal PMU placements for partial observability in the corresponding area. Many buses are double-highlighted with orange and another color, showing overlaps between full and partial observability. Non-overlapping optimal buses result from unobserved branches.
  • Figure 3: Source localization in a shallow ocean. Optimal receiver locations for localizing two sound sources (red speakers) in a shallow ocean with a sandy seabed using transmission loss data at 200 Hz. Small dots indicate candidate sites; large dots are the optimal receiver locations.
  • Figure 4: Uncertainty in the energy ($E$) of monolayer MoS$_2$ versus in-plane lattice parameter ($a$). Predictions are shifted by the energy $E_c$ at the equilibrium lattice constant $a_0$, aligning the minimum with the origin. The blue envelope is the target uncertainty (10% of the values predicted by the potential trained on the full dataset). In contrast, the red envelope shows the uncertainty propagated from the seven optimal training atomic configurations. Notice that the optimal propagated uncertainty is smaller than the target uncertainty.

Theorems & Definitions (1)

  • Theorem 1