Table of Contents
Fetching ...

Functional Risk Minimization

Ferran Alet, Clement Gehring, Tomás Lozano-Pérez, Kenji Kawaguchi, Joshua B. Tenenbaum, Leslie Pack Kaelbling

TL;DR

FRM reframes learning by modeling noise in function space rather than output space, introducing Functional Generative Models (FGMs) where each datum draws its own function $f_{\theta_i}$ from a shared prior $\mathcal{P}(\cdot|\theta^*)$. The resulting Functional Risk Minimization (FRM) objective projects the unknown function-space distribution onto a tractable family, yielding a cross-entropy-like criterion with a normalization term and enabling practical approximations via variational bounds and over-parameterization with Taylor/Laplace expansions. Empirical results across linear regression, value-function estimation, and representation learning demonstrate that FRM can outperform ERM, particularly when data exhibit structured, per-point variability. By explicitly optimizing in function space, FRM offers a pathway to understand and improve generalization in modern, highly over-parameterized models and large, diverse datasets. The work also outlines scalable approaches (e.g., last-layer adaptations, variational FRM) to apply FRM to real-world models like LLMs and VAEs, suggesting a broad practical impact for modeling dataset diversity through function-space priors.

Abstract

The field of Machine Learning has changed significantly since the 1970s. However, its most basic principle, Empirical Risk Minimization (ERM), remains unchanged. We propose Functional Risk Minimization~(FRM), a general framework where losses compare functions rather than outputs. This results in better performance in supervised, unsupervised, and RL experiments. In the FRM paradigm, for each data point $(x_i,y_i)$ there is function $f_{θ_i}$ that fits it: $y_i = f_{θ_i}(x_i)$. This allows FRM to subsume ERM for many common loss functions and to capture more realistic noise processes. We also show that FRM provides an avenue towards understanding generalization in the modern over-parameterized regime, as its objective can be framed as finding the simplest model that fits the training data.

Functional Risk Minimization

TL;DR

FRM reframes learning by modeling noise in function space rather than output space, introducing Functional Generative Models (FGMs) where each datum draws its own function from a shared prior . The resulting Functional Risk Minimization (FRM) objective projects the unknown function-space distribution onto a tractable family, yielding a cross-entropy-like criterion with a normalization term and enabling practical approximations via variational bounds and over-parameterization with Taylor/Laplace expansions. Empirical results across linear regression, value-function estimation, and representation learning demonstrate that FRM can outperform ERM, particularly when data exhibit structured, per-point variability. By explicitly optimizing in function space, FRM offers a pathway to understand and improve generalization in modern, highly over-parameterized models and large, diverse datasets. The work also outlines scalable approaches (e.g., last-layer adaptations, variational FRM) to apply FRM to real-world models like LLMs and VAEs, suggesting a broad practical impact for modeling dataset diversity through function-space priors.

Abstract

The field of Machine Learning has changed significantly since the 1970s. However, its most basic principle, Empirical Risk Minimization (ERM), remains unchanged. We propose Functional Risk Minimization~(FRM), a general framework where losses compare functions rather than outputs. This results in better performance in supervised, unsupervised, and RL experiments. In the FRM paradigm, for each data point there is function that fits it: . This allows FRM to subsume ERM for many common loss functions and to capture more realistic noise processes. We also show that FRM provides an avenue towards understanding generalization in the modern over-parameterized regime, as its objective can be framed as finding the simplest model that fits the training data.
Paper Structure (37 sections, 5 theorems, 17 equations, 9 figures, 1 table)

This paper contains 37 sections, 5 theorems, 17 equations, 9 figures, 1 table.

Key Result

Theorem 1

Let $l\ge 4$, $\mathcal{X}=[0,1]^t$, $\mathcal{Y}=[0,1]^m$, and $\mathcal{F}_{\Theta}^{k,l}$ be a set of all functions represented by $l$-layer neural networks with sigmoidal activation and $k$ neurons per hidden layer. Let $q$ be a probability measure on $(\mathcal{X} \times \mathcal{Y}, \mathcal{

Figures (9)

  • Figure 1: Modeling functional noise helps capture structured variations in diverse datasets.
  • Figure 2: For many common losses, ERM and FRM can be related to maximum likelihood under simple generative models. Red lines ending in a circle are stochastic, blue arrows are deterministic. ERM often models noise in output space, $y_i {\color{red} \sim} p(\cdot | y_i^*)$, and FRM explains it in function space, $\theta_i {\color{red} \sim} p(\theta)$.
  • Figure 3: Functional generative models for a linear function class. We can plot the function space in 2D on the bottom-left of each sub-figure, with the actual data plotted on the top-right.
  • Figure 4: ERM with common losses is equivalent to maximum likelihood under an FGM that is only stochastic in the output parameters. The particular distribution depends on the loss: a) MSE with a Gaussian b) L1 with a Laplace c) cross-entropy with a Gumbel d) accuracy with a delta plus flat distribution. In practice, the axis for "other parameters" will often refer to thousands of parameters.
  • Figure 5: Finding the projection of the unknown distribution $\color{cyan}\mathcal{P}(\theta)$ to the family $\mathcal{Q}_{\theta^*}(\theta)$ of probability distributions in function space. Here $\color{green} Q_{\theta_3^*}$ is best.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Theorem 1: Universal Distribution Theorem
  • Theorem 2: Universal Distribution Theorem
  • Lemma 1
  • Lemma 2
  • Lemma 3