Table of Contents
Fetching ...

On the Minimal Degree Bias in Generalization on the Unseen for non-Boolean Functions

Denys Pushkin, Raphaël Berthier, Emmanuel Abbe

TL;DR

This work investigates out-of-domain generalization under the GOTU setup for non-Boolean inputs, focusing on random feature models and Transformers. It proves that in the small features regime with polynomial activations RFs converge to minimum-degree interpolators on the unseen domain, while in the sparse regime the behavior depends on data encoding: roots-of-unity embeddings recover MDIs, whereas general integer or real-valued inputs can yield higher-degree interpolants and leaky biases, especially for Transformers. It also extends the Boolean roots-of-unity intuition to the complex setting and provides extensive experiments across discrete and continuous inputs, highlighting when the minimal-degree bias holds and when it breaks. Overall, the paper clarifies when minimal-degree structure governs learning under GOTU and shows that the Boolean special case is not representative of broader non-Boolean scenarios, motivating a more nuanced theory of implicit regularization under distribution shift.

Abstract

We investigate the out-of-domain generalization of random feature (RF) models and Transformers. We first prove that in the `generalization on the unseen (GOTU)' setting, where training data is fully seen in some part of the domain but testing is made on another part, and for RF models in the small feature regime, the convergence takes place to interpolators of minimal degree as in the Boolean case (Abbe et al., 2023). We then consider the sparse target regime and explain how this regime relates to the small feature regime, but with a different regularization term that can alter the picture in the non-Boolean case. We show two different outcomes for the sparse regime with q-ary data tokens: (1) if the data is embedded with roots of unities, then a min-degree interpolator is learned like in the Boolean case for RF models, (2) if the data is not embedded as such, e.g., simply as integers, then RF models and Transformers may not learn minimal degree interpolators. This shows that the Boolean setting and its roots of unities generalization are special cases where the minimal degree interpolator offers a rare characterization of how learning takes place. For more general integer and real-valued settings, a more nuanced picture remains to be fully characterized.

On the Minimal Degree Bias in Generalization on the Unseen for non-Boolean Functions

TL;DR

This work investigates out-of-domain generalization under the GOTU setup for non-Boolean inputs, focusing on random feature models and Transformers. It proves that in the small features regime with polynomial activations RFs converge to minimum-degree interpolators on the unseen domain, while in the sparse regime the behavior depends on data encoding: roots-of-unity embeddings recover MDIs, whereas general integer or real-valued inputs can yield higher-degree interpolants and leaky biases, especially for Transformers. It also extends the Boolean roots-of-unity intuition to the complex setting and provides extensive experiments across discrete and continuous inputs, highlighting when the minimal-degree bias holds and when it breaks. Overall, the paper clarifies when minimal-degree structure governs learning under GOTU and shows that the Boolean special case is not representative of broader non-Boolean scenarios, motivating a more nuanced theory of implicit regularization under distribution shift.

Abstract

We investigate the out-of-domain generalization of random feature (RF) models and Transformers. We first prove that in the `generalization on the unseen (GOTU)' setting, where training data is fully seen in some part of the domain but testing is made on another part, and for RF models in the small feature regime, the convergence takes place to interpolators of minimal degree as in the Boolean case (Abbe et al., 2023). We then consider the sparse target regime and explain how this regime relates to the small feature regime, but with a different regularization term that can alter the picture in the non-Boolean case. We show two different outcomes for the sparse regime with q-ary data tokens: (1) if the data is embedded with roots of unities, then a min-degree interpolator is learned like in the Boolean case for RF models, (2) if the data is not embedded as such, e.g., simply as integers, then RF models and Transformers may not learn minimal degree interpolators. This shows that the Boolean setting and its roots of unities generalization are special cases where the minimal degree interpolator offers a rare characterization of how learning takes place. For more general integer and real-valued settings, a more nuanced picture remains to be fully characterized.
Paper Structure (22 sections, 12 theorems, 94 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 12 theorems, 94 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.2

Consider training the random features model $f_{\textnormal{RF}}(a;x)$ in the small features regime (with parameter $\varepsilon$) on the polynomial target function $f$. Assume that we observe the target function on the training set $\mathcal{U}^c$, and that the activation function $\sigma$ satisfie

Figures (6)

  • Figure 1: Training the random features model on $f(x) = 1$ with GOTU constraint $x_1 = 1$ in small features regime. Here, $d=2$, $N=256$, $\varepsilon=(0.05)^2$, and $\sigma(x) = (1+x)^2$.
  • Figure 2: Training the random features model on $f(x) = x_2^2 + x_2 + 1$ with GOTU constraint $x_1 = 1$ in small features regime. Here, $d=2$, $N=16384$, $\varepsilon = (0.05)^2$ and $\sigma(x) = (1+x)^2$. The model converged to MDI, but not "the simplest one", since it depends on $x_1$.
  • Figure 3: Training the random features model on $f(x) = 1$ with GOTU constraint $x_1 = 1$ in sparse regime with $\sigma(x) = (1+x)^4$ activation. Here, $d=15$, $N=3\cdot10^5$, and $H_2(x_1)$ denotes the normalized second degree Hermite polynomial. The MDI is a constant function $1$, but the model learns the quadratic function.
  • Figure 4: Training the random features model on $f(x) = 1$, $x\in\{-2, -1, 0, 1, 2\}^d$ with GOTU constraint $x_1=1$ and $\sigma(x) = (1+x)^2$ activation. Here, $d=15$, $N = 1024$. While MDI is given by a constant function, the model learns a linear interpolator.
  • Figure 5: Training Transformer on $f(x) = 1$, $x\in\{-2, -1, 0, 1, 2\}^d$ with GOTU constraint $x_1=1$ using AdamW optimizer with $10^{-4}$ learning rate. Here, $d=15$.
  • ...and 1 more figures

Theorems & Definitions (30)

  • Theorem 4.2
  • Remark 4.3
  • Remark 4.4
  • Remark 4.5
  • Example 5.2
  • Example 5.3
  • Remark 5.4
  • Theorem 6.1
  • Lemma 2.1
  • proof
  • ...and 20 more