Generalization on the Unseen, Logic Reasoning and Degree Curriculum

Emmanuel Abbe; Samy Bengio; Aryo Lotfi; Kevin Rizk

Generalization on the Unseen, Logic Reasoning and Degree Curriculum

Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Kevin Rizk

TL;DR

A curriculum learning algorithm called Degree-Curriculum is introduced that learns monomials more efficiently by incrementing supports and an explanation to the length generalization problem for Boolean functions is provided.

Abstract

This paper considers the learning of logical (Boolean) functions with a focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for sparse functions and a class of network models including instances of Transformers, random features models, and linear networks, a min-degree-interpolator is learned on the unseen. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements. These findings lead to two implications: (1) we provide an explanation to the length generalization problem for Boolean functions (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports. Finally, we discuss extensions to other models or non-sparse regimes where the min-degree bias may still occur or fade, as well as how it can be potentially corrected when undesirable.

Generalization on the Unseen, Logic Reasoning and Degree Curriculum

TL;DR

Abstract

Paper Structure (36 sections, 10 theorems, 109 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 36 sections, 10 theorems, 109 equations, 11 figures, 1 table, 1 algorithm.

Introduction
Our Main Contributions
Generalization on the Unseen
Results
Preliminaries
Fourier-Walsh Transform
Unseen Domain and Vanishing Ideals
Main Theoretical Results
Result Preview from an Example
Results for Random Features Model
Results for Linear Neural Networks
Experiments
Length Generalization
Curriculum Learning
Min-Degree Bias Beyond the Previous Settings
...and 21 more sections

Key Result

Lemma 10

Any continuous polynomially-bounded function $\sigma$ such that its first $P$ coefficients in the Hermite expansion are non-zero is strongly expressive up to $P$.

Figures (11)

Figure 1: Target functions $f_1$ (left), $f_2$ (middle), and $f_3$ (right) learned by the encoder-only Transformer (top row) and the RF model (bottom row). Note that in all of the cases, the Transformer and the RF model learn a solution very close to the min-degree interpolator. More precisely, the coefficients of $x_0x_1, x_1x_2, x_2x_0$ in the left plot ($f_1$), the coefficient of $x_0x_1$ in the middle plot ($f_2$), and the coefficient of $x_0x_1x_2$ in the right plot ($f_3$) are close to zero.
Figure 2: Learning full parity function in dimension $d=15$ in the length generalization setting with inputs in $B_6, B_7, B_8, B_{9}, B_{10}$ and $B_{15}$ (full space) respectively, with an MLP (model details in Appendix \ref{['app:exps']}). X-axis: degree-profile component, Y-axis: degree-profile value, i.e., $\sum_{T:|T|=x}\hat{f}_{\mathrm{NN}}(T)^2$. As the length of training samples is decreased, the coefficient of the full parity gets smaller and the coefficients of low-degree monomials get larger.
Figure 3: Generalization loss on the 16-parity (left) and 30-parity (right) targets for different numbers of samples with and without the Degree-Curriculum Algorithm. We note that the MLP model trained without curriculum was not able to learn the full parity function in dimension 30 for the given sample sizes (and even up to $10^5$ samples), in contrast to the same model trained with the Degree-Curriculum.
Figure 4: Learning $(\mathrm{parity}_2, \mathcal{U}) = (x_0x_1, \{(x_0, x_1) = (-1,-1)\})$ (left) and $(\mathrm{parity}_4, \mathcal{U}) = (x_0x_1x_2x_3, \{x_0 =-1\})$ (right) embedded in different dimensions with different models. For $\mathrm{parity}_2$ (left) we can see that the min-degree bias is strong for the Transformer even for low-ambient dimensions. We can also see that for the RF model, the min-degree bias becomes stronger as the ambient dimension increases. For $\mathrm{parity}_4$ (right) we can see that the Transformer can almost recover the true function when the ambient and active dimensions match. As the ambient dimension grows slightly, we see that the coefficient of the higher degree term falls rapidly resulting in learning the MD interpolator.
Figure 5: Functions $f_1$ (left), $f_2$ (middle), and $f_3$ (right) of Section \ref{['sec:exps']} learned by the MLP (top row) and the mean-field model (bottom row). In all of these examples, the higher degree monomials (represented by the solid orange lines in the middle and left columns) are replaceable by the lower degree alternative (represented by the dashed lines). The MLP and mean-field models learn a leaky min-degree interpolator with the coefficient of the higher degree term possibly bounded away from 0.
...and 6 more figures

Theorems & Definitions (37)

Definition 1
Definition 2: Generalization on the Unseen
Definition 3
Definition 4: Degree
Definition 5: Degree profile
Definition 6: Min-degree interpolators
Definition 7
Definition 8: Random features model
Definition 9: Strongly expressive
Lemma 10
...and 27 more

Generalization on the Unseen, Logic Reasoning and Degree Curriculum

TL;DR

Abstract

Generalization on the Unseen, Logic Reasoning and Degree Curriculum

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (37)