Table of Contents
Fetching ...

A Mathematical Model for Curriculum Learning for Parities

Elisabetta Cornacchia, Elchanan Mossel

TL;DR

This work develops a mathematical model of curriculum learning (CL) for learning $k$-parities over $d$ bits using SGD, showing that presenting samples from biased product distributions in a two-stage curriculum can reduce computational cost to polynomial in $d$ under hinge or covariance losses. The main theoretical contribution demonstrates a 2-CL strategy enabling a two-layer network to learn $k$-parities in poly$(d)$ time, with a constructive two-phase training: first recover parity support under a biased distribution, then train on the unbiased distribution to achieve $\epsilon$-generalization under $\mathrm{Rad}(1/2)^{\otimes d}$. They also define Hamming mixtures as a negative benchmark for bounded CL steps, proving that bounded-r-CL cannot overcome learning hardness in that setting, and conjecture that a continuous CL (C-CL) approach could rectify this gap. Empirical results validate curriculum gains for parity learning on realistic architectures and illustrate sensitivity to the initial bias, while the work discusses practical limitations and directions for extending CL to broader function classes and sampling schemes.

Abstract

Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial.

A Mathematical Model for Curriculum Learning for Parities

TL;DR

This work develops a mathematical model of curriculum learning (CL) for learning -parities over bits using SGD, showing that presenting samples from biased product distributions in a two-stage curriculum can reduce computational cost to polynomial in under hinge or covariance losses. The main theoretical contribution demonstrates a 2-CL strategy enabling a two-layer network to learn -parities in poly time, with a constructive two-phase training: first recover parity support under a biased distribution, then train on the unbiased distribution to achieve -generalization under . They also define Hamming mixtures as a negative benchmark for bounded CL steps, proving that bounded-r-CL cannot overcome learning hardness in that setting, and conjecture that a continuous CL (C-CL) approach could rectify this gap. Empirical results validate curriculum gains for parity learning on realistic architectures and illustrate sensitivity to the initial bias, while the work discusses practical limitations and directions for extending CL to broader function classes and sampling schemes.

Abstract

Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial.
Paper Structure (23 sections, 26 theorems, 77 equations, 3 figures)

This paper contains 23 sections, 26 theorems, 77 equations, 3 figures.

Key Result

Theorem 1

There exists a 2-CL strategy such that a 2-layer fully connected network of $d^{O(1)}$ size trained by SGD with batch size $d^{O(1)}$ can learn any $k$-parities (for $k$ even) up to error $\epsilon$ in at most $d^{O(1)}/\epsilon^2$ iterations.

Figures (3)

  • Figure 1: Learning $20$-parities with $2$-steps curriculum, with initial bias $p_1=39/40$ (top-left), $p_1=19/20$ (top-center), $p_1=1/20$ (top-right), with continuous curriculum (bottom-left) and with no curriculum (bottom-right). In all plots, we use a 2-layers ReLU MLP with batch size 1024, input dimension 100, and 100 hidden units.
  • Figure 2: Convergence time for different values of $d$, $k$. Left: we take $p_1 = 1/16$ and a 2-layers $\mathop{\mathrm{ReLU}}\nolimits$ architecture with with $h=2^k$ hidden units. Right: we take $p_1 = 1-\frac{1}{2k}$ and a 2-layers $\mathop{\mathrm{ReLU}}\nolimits$ architecture with $h=d$ hidden units.
  • Figure 3: Convergence time with respect to the initial bias $p_1$. We compute the convergence time for learning a $10$-parity over $100$ bits with a 2-layer $\mathop{\mathrm{ReLU}}\nolimits$ network. We omitted all points with convergence time above $100,000$.

Theorems & Definitions (54)

  • Definition 1: r-steps curriculum learning (r-CL)
  • Definition 2: Generalization error
  • Theorem 1: Main positive result, informal
  • Definition 3: (S,T,$\epsilon)$-Hamming mixture
  • Theorem 2: Main negative result, informal
  • Definition 4: Continuous curriculum learning (C-CL)
  • Theorem 3: Hinge Loss
  • Definition 5: Covariance loss
  • Remark 1
  • Theorem 4: Covariance Loss
  • ...and 44 more