Functional Risk Minimization
Ferran Alet, Clement Gehring, Tomás Lozano-Pérez, Kenji Kawaguchi, Joshua B. Tenenbaum, Leslie Pack Kaelbling
TL;DR
FRM reframes learning by modeling noise in function space rather than output space, introducing Functional Generative Models (FGMs) where each datum draws its own function $f_{\theta_i}$ from a shared prior $\mathcal{P}(\cdot|\theta^*)$. The resulting Functional Risk Minimization (FRM) objective projects the unknown function-space distribution onto a tractable family, yielding a cross-entropy-like criterion with a normalization term and enabling practical approximations via variational bounds and over-parameterization with Taylor/Laplace expansions. Empirical results across linear regression, value-function estimation, and representation learning demonstrate that FRM can outperform ERM, particularly when data exhibit structured, per-point variability. By explicitly optimizing in function space, FRM offers a pathway to understand and improve generalization in modern, highly over-parameterized models and large, diverse datasets. The work also outlines scalable approaches (e.g., last-layer adaptations, variational FRM) to apply FRM to real-world models like LLMs and VAEs, suggesting a broad practical impact for modeling dataset diversity through function-space priors.
Abstract
The field of Machine Learning has changed significantly since the 1970s. However, its most basic principle, Empirical Risk Minimization (ERM), remains unchanged. We propose Functional Risk Minimization~(FRM), a general framework where losses compare functions rather than outputs. This results in better performance in supervised, unsupervised, and RL experiments. In the FRM paradigm, for each data point $(x_i,y_i)$ there is function $f_{θ_i}$ that fits it: $y_i = f_{θ_i}(x_i)$. This allows FRM to subsume ERM for many common loss functions and to capture more realistic noise processes. We also show that FRM provides an avenue towards understanding generalization in the modern over-parameterized regime, as its objective can be framed as finding the simplest model that fits the training data.
