Principled Out-of-Distribution Generalization via Simplicity
Jiawei Ge, Amanda Wang, Shange Tang, Chi Jin
TL;DR
The paper investigates the theoretical underpinnings of out-of-distribution generalization in modern foundation models through a simplicity principle. By formalizing a simplicity metric $R(\beta)$ and the ground-truth $\beta^{\star}$ as the simplest training-minimizer, it analyzes a regularized maximum likelihood estimator under covariate shift in two regimes: a constant-gap regime with a fixed simplicity gap $\Delta$ and a vanishing-gap regime with a smooth proximity condition. The authors derive sharp non-asymptotic excess-risk bounds, achieving a fast $\tilde{O}(1/n)$ rate in the constant-gap setting and a tunable rate $\tilde{O}(n^{-1+2/(3\tau)})$ in the vanishing-gap regime, where $\tau$ governs the softness of the gap. Across theoretical development and illustrative experiments on diffusion models and a simplified MLP identity task, the work argues that the simplest model among source-minimizers generalizes best to the target distribution, offering a principled explanation for robust OOD behavior and guiding regularization strategies in practice.
Abstract
Modern foundation models exhibit remarkable out-of-distribution (OOD) generalization, solving tasks far beyond the support of their training data. However, the theoretical principles underpinning this phenomenon remain elusive. This paper investigates this problem by examining the compositional generalization abilities of diffusion models in image generation. Our analysis reveals that while neural network architectures are expressive enough to represent a wide range of models -- including many with undesirable behavior on OOD inputs -- the true, generalizable model that aligns with human expectations typically corresponds to the simplest among those consistent with the training data. Motivated by this observation, we develop a theoretical framework for OOD generalization via simplicity, quantified using a predefined simplicity metric. We analyze two key regimes: (1) the constant-gap setting, where the true model is strictly simpler than all spurious alternatives by a fixed gap, and (2) the vanishing-gap setting, where the fixed gap is replaced by a smoothness condition ensuring that models close in simplicity to the true model yield similar predictions. For both regimes, we study the regularized maximum likelihood estimator and establish the first sharp sample complexity guarantees for learning the true, generalizable, simple model.
