Optimistic Rates for Learning from Label Proportions

Gene Li; Lin Chen; Adel Javanmard; Vahab Mirrokni

Optimistic Rates for Learning from Label Proportions

Gene Li, Lin Chen, Adel Javanmard, Vahab Mirrokni

TL;DR

This work addresses learning from label proportions (LLP), where training bags reveal only average labels for the bag rather than individual labels. It analyzes the learning guarantees of several LLP rules, showing that Empirical Proportional Risk Minimization (EPRM) attains fast rates under realizability but can fail in the agnostic setting, and introduces two optimistic-rate rules—debiased square loss (DSQ) and EasyLLP—that achieve near-optimal sample complexity in both settings. The results include precise upper and lower bounds, not only for realizable vs. agnostic regimes but also on the impact of bag size $k$ and marginal proportion estimation, complemented by experiments on MNIST and CIFAR10 that illustrate optimization advantages of debiasing and the empirical performance of the proposed rules. Overall, the paper advances the theory and practice of LLP by providing methods that adapt to realizability and by clarifying the limits of proportion-matching approaches. These insights are relevant for weakly supervised learning scenarios where instance-level labels are unavailable, and they have implications for privacy-preserving data collection and scalable weak supervision.

Abstract

We consider a weakly supervised learning problem called Learning from Label Proportions (LLP), where examples are grouped into ``bags'' and only the average label within each bag is revealed to the learner. We study various learning rules for LLP that achieve PAC learning guarantees for classification loss. We establish that the classical Empirical Proportional Risk Minimization (EPRM) learning rule (Yu et al., 2014) achieves fast rates under realizability, but EPRM and similar proportion matching learning rules can fail in the agnostic setting. We also show that (1) a debiased proportional square loss, as well as (2) a recently proposed EasyLLP learning rule (Busa-Fekete et al., 2023) both achieve ``optimistic rates'' (Panchenko, 2002); in both the realizable and agnostic settings, their sample complexity is optimal (up to log factors) in terms of $ε, δ$, and VC dimension.

Optimistic Rates for Learning from Label Proportions

TL;DR

and marginal proportion estimation, complemented by experiments on MNIST and CIFAR10 that illustrate optimization advantages of debiasing and the empirical performance of the proposed rules. Overall, the paper advances the theory and practice of LLP by providing methods that adapt to realizability and by clarifying the limits of proportion-matching approaches. These insights are relevant for weakly supervised learning scenarios where instance-level labels are unavailable, and they have implications for privacy-preserving data collection and scalable weak supervision.

Abstract

, and VC dimension.

Paper Structure (55 sections, 20 theorems, 112 equations, 3 figures, 2 tables)

This paper contains 55 sections, 20 theorems, 112 equations, 3 figures, 2 tables.

Introduction
Problem Formulation.
Notation.
Our Contributions
Success and Failure of Empirical Proportional Risk Minimization (sec:eprm):
Optimistic Rates for Debiased Square Loss (sec:debiased-square-loss):
Optimistic Rates for EasyLLP (sec:easyllp-main-text):
Lower Bounds (sec:main-lower-bounds):
Experiments (sec:experiments):
Related Work
Learning from Label Proportions.
Optimistic Rates.
Empirical Proportional Risk Minimization
Fast Rates for EPRM Under Realizability
Step 1.
...and 40 more sections

Key Result

Theorem 1

Let $\mathcal{F}$ be a symmetric function class, i.e. if $f \in \mathcal{F}$, then $1-f \in \mathcal{F}$. Let bag size $k \ge 11$, $\varepsilon \in (*){0, 1/(4k)}$, and $\delta \in (0,1)$. As long as $n = O ( \tfrac{d\log k \cdot \log (1/\varepsilon) + \log(1/\delta)}{\varepsilon} )$, for any realiz

Figures (3)

Figure 1: Training curves of various algorithms for LLP, using the large CNN architecture and bag size $k=100$. One standard deviation confidence bands are plotted.
Figure 2: Left: Loss estimates throughout training. We run a single trial of EZ.Sq and DebiasedSq on MNIST Odd vs. Even using the large CNN architecture, bag size $k=10$, and optimally chosen learning rate. Using the test set, we plot the averaged true square loss $\tfrac{1}{n_\mathrm{test}} \sum_{i=1}^{n_\mathrm{test}} \tfrac{1}{k} \sum_{j=1}^k (f_\theta(x_{i,j}) - y_{i,j})^2$ vs. the estimated square loss $\tfrac{1}{n_\mathrm{test}} \sum_{i=1}^{n_\mathrm{test}} \widehat{\ell}_\mathrm{est}(B_i, \alpha_i)$, where $\widehat{\ell}_\mathrm{est}$ is either the EZ.Sq/DebiasedSq loss estimate. Middle and Right: Histogram of true per-bag square losses $\{\tfrac{1}{k} \sum_{j=1}^k (f_\theta(x_{i,j}) - y_{i,j})^2\}_{i=1}^{n_\mathrm{test}}$ and per-bag loss estimates $\{\widehat{\ell}_\mathrm{est}(B_i, \alpha_i)\}_{i=1}^{n_\mathrm{test}}$ for EZ.Sq/DebiasedSq loss estimates at epoch 10.
Figure 3: Training curves for PM.Sq and DebiasedSq with various $\beta$ for bag size $k=10$ on the small CNN architecture. We use a fixed learning rate of $0.001$ and run full-batch GD for 1000 epochs. Each line is an average over 10 trials with different random seeds.

Theorems & Definitions (31)

Theorem 1
proof : Proof of thm:prop-matching-realizable.
Proposition 1
proof : Proof of prop:prop-matching-failure.
Theorem 2: Sample Complexity Bound for $\widehat{f}_\mathrm{DSQ}$
Proposition 2
proof : Proof.
Theorem 3: Sample Complexity Bound for $\widehat{f}_\mathrm{EZ}$
Lemma 1
Corollary 1: Sample Complexity Bound for $\widehat{f}_\mathrm{EZ}$ with Sample Splitting
...and 21 more

Optimistic Rates for Learning from Label Proportions

TL;DR

Abstract

Optimistic Rates for Learning from Label Proportions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (31)