Table of Contents
Fetching ...

Multifidelity linear regression for scientific machine learning from scarce data

Elizabeth Qian, Dayoung Kang, Vignesh Sella, Anirban Chaudhuri

TL;DR

This work tackles learning surrogates when high-fidelity data are scarce by introducing multifidelity control variate estimators for linear regression. It formulates a principled MFMC-based framework that combines high- and low-fidelity data to obtain unbiased estimates of the regression cross-covariate and the resulting coefficients, while reducing variance. The authors provide bias-variance analysis, optimality results for control variate coefficients, and practical guidance for hyperparameter choices and sample allocation. Numerical experiments on an analytic exponential function and a PDE-based convection-diffusion-reaction problem show that the multifidelity approach matches or closely approaches the high-fidelity accuracy with dramatically reduced high-fidelity data requirements, demonstrating significant data-efficiency gains for scientific machine learning under scarcity.

Abstract

Machine learning (ML) methods, which fit to data the parameters of a given parameterized model class, have garnered significant interest as potential methods for learning surrogate models for complex engineering systems for which traditional simulation is expensive. However, in many scientific and engineering settings, generating high-fidelity data on which to train ML models is expensive, and the available budget for generating training data is limited, so that high-fidelity training data are scarce. ML models trained on scarce data have high variance, resulting in poor expected generalization performance. We propose a new multifidelity training approach for scientific machine learning via linear regression that exploits the scientific context where data of varying fidelities and costs are available: for example, high-fidelity data may be generated by an expensive fully resolved physics simulation whereas lower-fidelity data may arise from a cheaper model based on simplifying assumptions. We use the multifidelity data within an approximate control variate framework to define new multifidelity Monte Carlo estimators for linear regression models. We provide bias and variance analysis of our new estimators that guarantee the approach's accuracy and improved robustness to scarce high-fidelity data. Numerical results demonstrate that our multifidelity training approach achieves similar accuracy to the standard high-fidelity only approach with orders-of-magnitude reduced high-fidelity data requirements.

Multifidelity linear regression for scientific machine learning from scarce data

TL;DR

This work tackles learning surrogates when high-fidelity data are scarce by introducing multifidelity control variate estimators for linear regression. It formulates a principled MFMC-based framework that combines high- and low-fidelity data to obtain unbiased estimates of the regression cross-covariate and the resulting coefficients, while reducing variance. The authors provide bias-variance analysis, optimality results for control variate coefficients, and practical guidance for hyperparameter choices and sample allocation. Numerical experiments on an analytic exponential function and a PDE-based convection-diffusion-reaction problem show that the multifidelity approach matches or closely approaches the high-fidelity accuracy with dramatically reduced high-fidelity data requirements, demonstrating significant data-efficiency gains for scientific machine learning under scarcity.

Abstract

Machine learning (ML) methods, which fit to data the parameters of a given parameterized model class, have garnered significant interest as potential methods for learning surrogate models for complex engineering systems for which traditional simulation is expensive. However, in many scientific and engineering settings, generating high-fidelity data on which to train ML models is expensive, and the available budget for generating training data is limited, so that high-fidelity training data are scarce. ML models trained on scarce data have high variance, resulting in poor expected generalization performance. We propose a new multifidelity training approach for scientific machine learning via linear regression that exploits the scientific context where data of varying fidelities and costs are available: for example, high-fidelity data may be generated by an expensive fully resolved physics simulation whereas lower-fidelity data may arise from a cheaper model based on simplifying assumptions. We use the multifidelity data within an approximate control variate framework to define new multifidelity Monte Carlo estimators for linear regression models. We provide bias and variance analysis of our new estimators that guarantee the approach's accuracy and improved robustness to scarce high-fidelity data. Numerical results demonstrate that our multifidelity training approach achieves similar accuracy to the standard high-fidelity only approach with orders-of-magnitude reduced high-fidelity data requirements.
Paper Structure (18 sections, 6 theorems, 39 equations, 5 figures, 3 tables)

This paper contains 18 sections, 6 theorems, 39 equations, 5 figures, 3 tables.

Key Result

Theorem 3.1

Unbiasedness of multifidelity linear regression approach:

Figures (5)

  • Figure 1: Exponential function example: 500 realizations (semi-transparent colored lines) of estimators for first entry of $\hat{c}_{XY}$ (left), first regression coefficient (center), and regression model prediction $\hat{f}(z=5;\hat{\beta})$ (right), when true model statistics are known. Black lines denote true values and lines with markers show the mean over the 500 realizations of training data.
  • Figure 2: Exponential example: comparing learned models learned with the standard high-fidelity only (HF) and proposed multifidelity (MF) training approach.
  • Figure 3: Analytical example: convergence of multifidelity linear regression estimators for $\hat{c}_{XY}$ (top), $\hat{\beta}$ (middle), and $\hat{f}(z;\hat{\beta})$ (bottom) when model statistics are exact (left), estimated using 100 pilot samples (center), or 10 pilot samples (right).
  • Figure 4: PDE model problem: convergence of multifidelity linear regression estimators for $\hat{c}_{XY}$ (top), $\hat{\beta}$ (middle), and $\hat{f}(z;\hat{\beta})$ (bottom), when model statistics are estimated using $10^5$ (left), 100 (center), or 10 (right) pilot samples. Results for the multifidelity approach with the optimal matrix coefficient are omitted in the second and third columns because their variances are so large that they would significantly distort the plot axes.
  • Figure 5: Generalization error over 1000 unseen test data. Plotted lines and shaded regions are the mean and first standard deviation over 500 learned models trained on independently realizations of training data.

Theorems & Definitions (12)

  • Theorem 3.1
  • proof
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof
  • Theorem 3.4
  • proof
  • Lemma 3.5
  • proof
  • ...and 2 more