On the Minimal Degree Bias in Generalization on the Unseen for non-Boolean Functions
Denys Pushkin, Raphaël Berthier, Emmanuel Abbe
TL;DR
This work investigates out-of-domain generalization under the GOTU setup for non-Boolean inputs, focusing on random feature models and Transformers. It proves that in the small features regime with polynomial activations RFs converge to minimum-degree interpolators on the unseen domain, while in the sparse regime the behavior depends on data encoding: roots-of-unity embeddings recover MDIs, whereas general integer or real-valued inputs can yield higher-degree interpolants and leaky biases, especially for Transformers. It also extends the Boolean roots-of-unity intuition to the complex setting and provides extensive experiments across discrete and continuous inputs, highlighting when the minimal-degree bias holds and when it breaks. Overall, the paper clarifies when minimal-degree structure governs learning under GOTU and shows that the Boolean special case is not representative of broader non-Boolean scenarios, motivating a more nuanced theory of implicit regularization under distribution shift.
Abstract
We investigate the out-of-domain generalization of random feature (RF) models and Transformers. We first prove that in the `generalization on the unseen (GOTU)' setting, where training data is fully seen in some part of the domain but testing is made on another part, and for RF models in the small feature regime, the convergence takes place to interpolators of minimal degree as in the Boolean case (Abbe et al., 2023). We then consider the sparse target regime and explain how this regime relates to the small feature regime, but with a different regularization term that can alter the picture in the non-Boolean case. We show two different outcomes for the sparse regime with q-ary data tokens: (1) if the data is embedded with roots of unities, then a min-degree interpolator is learned like in the Boolean case for RF models, (2) if the data is not embedded as such, e.g., simply as integers, then RF models and Transformers may not learn minimal degree interpolators. This shows that the Boolean setting and its roots of unities generalization are special cases where the minimal degree interpolator offers a rare characterization of how learning takes place. For more general integer and real-valued settings, a more nuanced picture remains to be fully characterized.
