A Mathematical Model for Curriculum Learning for Parities
Elisabetta Cornacchia, Elchanan Mossel
TL;DR
This work develops a mathematical model of curriculum learning (CL) for learning $k$-parities over $d$ bits using SGD, showing that presenting samples from biased product distributions in a two-stage curriculum can reduce computational cost to polynomial in $d$ under hinge or covariance losses. The main theoretical contribution demonstrates a 2-CL strategy enabling a two-layer network to learn $k$-parities in poly$(d)$ time, with a constructive two-phase training: first recover parity support under a biased distribution, then train on the unbiased distribution to achieve $\epsilon$-generalization under $\mathrm{Rad}(1/2)^{\otimes d}$. They also define Hamming mixtures as a negative benchmark for bounded CL steps, proving that bounded-r-CL cannot overcome learning hardness in that setting, and conjecture that a continuous CL (C-CL) approach could rectify this gap. Empirical results validate curriculum gains for parity learning on realistic architectures and illustrate sensitivity to the initial bias, while the work discusses practical limitations and directions for extending CL to broader function classes and sampling schemes.
Abstract
Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial.
