A Mathematical Model for Curriculum Learning for Parities

Elisabetta Cornacchia; Elchanan Mossel

A Mathematical Model for Curriculum Learning for Parities

Elisabetta Cornacchia, Elchanan Mossel

TL;DR

This work develops a mathematical model of curriculum learning (CL) for learning $k$-parities over $d$ bits using SGD, showing that presenting samples from biased product distributions in a two-stage curriculum can reduce computational cost to polynomial in $d$ under hinge or covariance losses. The main theoretical contribution demonstrates a 2-CL strategy enabling a two-layer network to learn $k$-parities in poly$(d)$ time, with a constructive two-phase training: first recover parity support under a biased distribution, then train on the unbiased distribution to achieve $\epsilon$-generalization under $\mathrm{Rad}(1/2)^{\otimes d}$. They also define Hamming mixtures as a negative benchmark for bounded CL steps, proving that bounded-r-CL cannot overcome learning hardness in that setting, and conjecture that a continuous CL (C-CL) approach could rectify this gap. Empirical results validate curriculum gains for parity learning on realistic architectures and illustrate sensitivity to the initial bias, while the work discusses practical limitations and directions for extending CL to broader function classes and sampling schemes.

Abstract

Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial.

A Mathematical Model for Curriculum Learning for Parities

TL;DR

This work develops a mathematical model of curriculum learning (CL) for learning

-parities over

bits using SGD, showing that presenting samples from biased product distributions in a two-stage curriculum can reduce computational cost to polynomial in

under hinge or covariance losses. The main theoretical contribution demonstrates a 2-CL strategy enabling a two-layer network to learn

-parities in poly

time, with a constructive two-phase training: first recover parity support under a biased distribution, then train on the unbiased distribution to achieve

-generalization under

. They also define Hamming mixtures as a negative benchmark for bounded CL steps, proving that bounded-r-CL cannot overcome learning hardness in that setting, and conjecture that a continuous CL (C-CL) approach could rectify this gap. Empirical results validate curriculum gains for parity learning on realistic architectures and illustrate sensitivity to the initial bias, while the work discusses practical limitations and directions for extending CL to broader function classes and sampling schemes.

Abstract

Paper Structure (23 sections, 26 theorems, 77 equations, 3 figures)

This paper contains 23 sections, 26 theorems, 77 equations, 3 figures.

Introduction
Contributions.
Related Work
Definitions and Main Results
Learning Parities
Theoretical Results
Empirical Results
Learning Hamming Mixtures
Conclusion and Future Work
Proof of Theorem \ref{['thm:positive_result']}
Proof Setup
First Step: Recovering the Support
Population gradient at initialization.
Effective gradient at initialization.
Second Step: Convergence
...and 8 more sections

Key Result

Theorem 1

There exists a 2-CL strategy such that a 2-layer fully connected network of $d^{O(1)}$ size trained by SGD with batch size $d^{O(1)}$ can learn any $k$-parities (for $k$ even) up to error $\epsilon$ in at most $d^{O(1)}/\epsilon^2$ iterations.

Figures (3)

Figure 1: Learning $20$-parities with $2$-steps curriculum, with initial bias $p_1=39/40$ (top-left), $p_1=19/20$ (top-center), $p_1=1/20$ (top-right), with continuous curriculum (bottom-left) and with no curriculum (bottom-right). In all plots, we use a 2-layers ReLU MLP with batch size 1024, input dimension 100, and 100 hidden units.
Figure 2: Convergence time for different values of $d$, $k$. Left: we take $p_1 = 1/16$ and a 2-layers $\mathop{\mathrm{ReLU}}\nolimits$ architecture with with $h=2^k$ hidden units. Right: we take $p_1 = 1-\frac{1}{2k}$ and a 2-layers $\mathop{\mathrm{ReLU}}\nolimits$ architecture with $h=d$ hidden units.
Figure 3: Convergence time with respect to the initial bias $p_1$. We compute the convergence time for learning a $10$-parity over $100$ bits with a 2-layer $\mathop{\mathrm{ReLU}}\nolimits$ network. We omitted all points with convergence time above $100,000$.

Theorems & Definitions (54)

Definition 1: r-steps curriculum learning (r-CL)
Definition 2: Generalization error
Theorem 1: Main positive result, informal
Definition 3: (S,T,$\epsilon)$-Hamming mixture
Theorem 2: Main negative result, informal
Definition 4: Continuous curriculum learning (C-CL)
Theorem 3: Hinge Loss
Definition 5: Covariance loss
Remark 1
Theorem 4: Covariance Loss
...and 44 more

A Mathematical Model for Curriculum Learning for Parities

TL;DR

Abstract

A Mathematical Model for Curriculum Learning for Parities

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (54)