Table of Contents
Fetching ...

Fundamentals of Regression

Miguel A. Mendez

TL;DR

The chapter surveys regression as learning a stochastic mapping from inputs to outputs, rooted in maximum likelihood and uncertainty quantification. It surveys parametric and non-parametric approaches, including linear bases, ANNs, kernel methods, and symbolic regression, and shows how these methods connect to physics via physics-informed cost functions and PINNs. It highlights bootstrapping, cross-validation, and robust loss functions as tools for generalization and resilience to outliers, and it outlines strategies to couple data-driven models with PDE-based physics (e.g., FEM, PINNs) to yield physics-consistent predictions. The work provides a framework for hybridizing machine learning with numerical methods for physics, enabling robust uncertainty quantification and improved physical fidelity across scientific computing tasks.

Abstract

This chapter opens with a review of classic tools for regression, a subset of machine learning that seeks to find relationships between variables. With the advent of scientific machine learning this field has moved from a purely data-driven (statistical) formalism to a constrained or ``physics-informed'' formalism, which integrates physical knowledge and methods from traditional computational engineering. In the first part, we introduce the general concepts and the statistical flavor of regression versus other forms of curve fitting. We then move to an overview of traditional methods from machine learning and their classification and ways to link these to traditional computational science. Finally, we close with a note on methods to combine machine learning and numerical methods for physics

Fundamentals of Regression

TL;DR

The chapter surveys regression as learning a stochastic mapping from inputs to outputs, rooted in maximum likelihood and uncertainty quantification. It surveys parametric and non-parametric approaches, including linear bases, ANNs, kernel methods, and symbolic regression, and shows how these methods connect to physics via physics-informed cost functions and PINNs. It highlights bootstrapping, cross-validation, and robust loss functions as tools for generalization and resilience to outliers, and it outlines strategies to couple data-driven models with PDE-based physics (e.g., FEM, PINNs) to yield physics-consistent predictions. The work provides a framework for hybridizing machine learning with numerical methods for physics, enabling robust uncertainty quantification and improved physical fidelity across scientific computing tasks.

Abstract

This chapter opens with a review of classic tools for regression, a subset of machine learning that seeks to find relationships between variables. With the advent of scientific machine learning this field has moved from a purely data-driven (statistical) formalism to a constrained or ``physics-informed'' formalism, which integrates physical knowledge and methods from traditional computational engineering. In the first part, we introduce the general concepts and the statistical flavor of regression versus other forms of curve fitting. We then move to an overview of traditional methods from machine learning and their classification and ways to link these to traditional computational science. Finally, we close with a note on methods to combine machine learning and numerical methods for physics

Paper Structure

This paper contains 13 sections, 39 equations, 6 figures.

Figures (6)

  • Figure 1: General overview of the regression framework, pictorially illustrated for a scalar problem. Left: Two possible models fit the training data (black dots). A prediction is requested and a stochastic process has to be fitted to the data. This can be described as in \ref{['eq1']} with a deterministic model for the mean and a stochastic model for the local distribution.
  • Figure 2: Dataset for tutorial 1, to illustrate the usage of cross-validation
  • Figure 3: Tutorial to illustrate the usage of bootstrapping to estimate uncertainties via bootstrapping. Top: prediction of the two models versus data. Bottom: distribution of in-sample MSE (left) and out-of-sample MSE (right). The more complex model is more prone to overfitting.
  • Figure 4: A simple example of feedforward, fully connected architecture with two hidden layers and a total of seven neurons.
  • Figure 5: Syntax tree representation of the function $2 x\sin(x)+\sin(x)+3$. This tree has a root '+' and a depth of two. The nodes are denoted with orange circles, while the last entries are leafs.
  • ...and 1 more figures