Table of Contents
Fetching ...

Statistical curriculum learning: An elimination algorithm achieving an oracle risk

Omer Cohen, Ron Meir, Nir Weinberger

TL;DR

The paper studies statistical curriculum learning in a parametric mean-estimation setting with a target task and multiple source tasks differing in similarity $Q_t$ and noise variance $\sigma_t^2$. It introduces an adaptive, multi-round source-elimination CL algorithm that prunes sources based on estimated similarity and a quantified elimination curve, achieving weak-oracle–level risk after $O( ext{log}T)$ rounds and, in the single-source case, matching the strong-oracle rate. The work presents two minimax lower bounds under localized problem instances, discusses the challenges of constructing homogeneous instance sets, and identifies regimes where the weak oracle is minimax optimal (notably $T\le 2$); it also extends the framework to unknown variances/covariances and supports empirical validation. Overall, the results provide a principled, theoretically grounded approach to curriculum design in statistical learning with multiple sources, highlighting when adaptive sampling yields optimal or near-optimal risk. The findings have implications for transfer/meta-learning and structured CL in high-dimensional parametric settings.

Abstract

We consider a statistical version of curriculum learning (CL) in a parametric prediction setting. The learner is required to estimate a target parameter vector, and can adaptively collect samples from either the target model, or other source models that are similar to the target model, but less noisy. We consider three types of learners, depending on the level of side-information they receive. The first two, referred to as strong/weak-oracle learners, receive high/low degrees of information about the models, and use these to learn. The third, a fully adaptive learner, estimates the target parameter vector without any prior information. In the single source case, we propose an elimination learning method, whose risk matches that of a strong-oracle learner. In the multiple source case, we advocate that the risk of the weak-oracle learner is a realistic benchmark for the risk of adaptive learners. We develop an adaptive multiple elimination-rounds CL algorithm, and characterize instance-dependent conditions for its risk to match that of the weak-oracle learner. We consider instance-dependent minimax lower bounds, and discuss the challenges associated with defining the class of instances for the bound. We derive two minimax lower bounds, and determine the conditions under which the performance weak-oracle learner is minimax optimal.

Statistical curriculum learning: An elimination algorithm achieving an oracle risk

TL;DR

The paper studies statistical curriculum learning in a parametric mean-estimation setting with a target task and multiple source tasks differing in similarity and noise variance . It introduces an adaptive, multi-round source-elimination CL algorithm that prunes sources based on estimated similarity and a quantified elimination curve, achieving weak-oracle–level risk after rounds and, in the single-source case, matching the strong-oracle rate. The work presents two minimax lower bounds under localized problem instances, discusses the challenges of constructing homogeneous instance sets, and identifies regimes where the weak oracle is minimax optimal (notably ); it also extends the framework to unknown variances/covariances and supports empirical validation. Overall, the results provide a principled, theoretically grounded approach to curriculum design in statistical learning with multiple sources, highlighting when adaptive sampling yields optimal or near-optimal risk. The findings have implications for transfer/meta-learning and structured CL in high-dimensional parametric settings.

Abstract

We consider a statistical version of curriculum learning (CL) in a parametric prediction setting. The learner is required to estimate a target parameter vector, and can adaptively collect samples from either the target model, or other source models that are similar to the target model, but less noisy. We consider three types of learners, depending on the level of side-information they receive. The first two, referred to as strong/weak-oracle learners, receive high/low degrees of information about the models, and use these to learn. The third, a fully adaptive learner, estimates the target parameter vector without any prior information. In the single source case, we propose an elimination learning method, whose risk matches that of a strong-oracle learner. In the multiple source case, we advocate that the risk of the weak-oracle learner is a realistic benchmark for the risk of adaptive learners. We develop an adaptive multiple elimination-rounds CL algorithm, and characterize instance-dependent conditions for its risk to match that of the weak-oracle learner. We consider instance-dependent minimax lower bounds, and discuss the challenges associated with defining the class of instances for the bound. We derive two minimax lower bounds, and determine the conditions under which the performance weak-oracle learner is minimax optimal.
Paper Structure (32 sections, 21 theorems, 185 equations, 7 figures, 1 algorithm)

This paper contains 32 sections, 21 theorems, 185 equations, 7 figures, 1 algorithm.

Key Result

Theorem 1

Let $\tilde{\theta}_{0}=\bar{\theta}_{0}(N/2)$ (resp. $\tilde{\theta}_{1}=\bar{\theta}_{1}(N/2)$) be an initial estimate of $\theta_{0}$ using $N/2$ i.i.d. samples from the target model ${\cal M}_{0}$ (resp. source model ${\cal M}_{1}$). Let $\delta\in(0,1)$ be given, and let be the final estimate of $\theta_{0}$. Then, there exists $\nu\in[1/27,1)$ such with probability at least $1-\delta$ Assum

Figures (7)

  • Figure 1: Left: The elimination curves $\beta_{\delta}(\tau)$ for Example \ref{['exa: beta function']} (the identity line in dashed yellow). Right: The loss $\|\hat{\theta}-\theta_{0}\|^{2}$ on a log-scale. Parameters are $N=10^{5},\;d=2,\;\sigma^{2}=0.1,\;\sigma_{0}^{2}=1,\;\tilde{Q}_{\text{close}}^{2}=0,\;\tilde{Q}_{\text{medium}}^{2}=10,\;\tilde{Q}_{\text{far}}^{2}=2\cdot10^{4}$ where $\tilde{Q}_{t}^{2}:=Q_{t}^{2}/(d\sigma_{0}^{2}/N)$ is the normalized distance.
  • Figure 2: The first experiment: Runs of Algorithm \ref{['alg: CL multiple sources']} over $200$ repetitions. Parameters are $T=2,\;N=1000,\;d=2,\;\sigma^{2}=1,\;\sigma_{0}^{2}=10$.
  • Figure 3: The second experiment: Runs of Algorithm \ref{['alg: CL multiple sources']} over $200$ repetitions. Parameters are $N=10^{5},\;d=2,\;\sigma^{2}=0.1,\;\sigma_{0}^{2}=1,\;\tilde{Q}_{\text{close}}^{2}=0,\;\tilde{Q}_{\text{medium}}^{2}=10,\;\tilde{Q}_{\text{far}}^{2}=2\cdot10^{4}$.
  • Figure 4: The third experiment: Parameters are $N=10^{5},\;d=2,\;\sigma^{2}=0.1,\;\sigma_{0}^{2}=1,\;\delta=0.05$.
  • Figure 5: The parameter hypotheses for $T=1$. One of the hypotheses is in black and the other in red. The target parameter is designated by a disc and the source parameter by a square.
  • ...and 2 more figures

Theorems & Definitions (47)

  • Theorem 1
  • Corollary 1: to the source elimination lemma, Lemma \ref{['lem: iden']}
  • Example 1
  • Proposition 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Lemma 1
  • proof
  • Theorem 5: koltchinskii2015bounding
  • ...and 37 more