Optimistic Rates for Learning from Label Proportions
Gene Li, Lin Chen, Adel Javanmard, Vahab Mirrokni
TL;DR
This work addresses learning from label proportions (LLP), where training bags reveal only average labels for the bag rather than individual labels. It analyzes the learning guarantees of several LLP rules, showing that Empirical Proportional Risk Minimization (EPRM) attains fast rates under realizability but can fail in the agnostic setting, and introduces two optimistic-rate rules—debiased square loss (DSQ) and EasyLLP—that achieve near-optimal sample complexity in both settings. The results include precise upper and lower bounds, not only for realizable vs. agnostic regimes but also on the impact of bag size $k$ and marginal proportion estimation, complemented by experiments on MNIST and CIFAR10 that illustrate optimization advantages of debiasing and the empirical performance of the proposed rules. Overall, the paper advances the theory and practice of LLP by providing methods that adapt to realizability and by clarifying the limits of proportion-matching approaches. These insights are relevant for weakly supervised learning scenarios where instance-level labels are unavailable, and they have implications for privacy-preserving data collection and scalable weak supervision.
Abstract
We consider a weakly supervised learning problem called Learning from Label Proportions (LLP), where examples are grouped into ``bags'' and only the average label within each bag is revealed to the learner. We study various learning rules for LLP that achieve PAC learning guarantees for classification loss. We establish that the classical Empirical Proportional Risk Minimization (EPRM) learning rule (Yu et al., 2014) achieves fast rates under realizability, but EPRM and similar proportion matching learning rules can fail in the agnostic setting. We also show that (1) a debiased proportional square loss, as well as (2) a recently proposed EasyLLP learning rule (Busa-Fekete et al., 2023) both achieve ``optimistic rates'' (Panchenko, 2002); in both the realizable and agnostic settings, their sample complexity is optimal (up to log factors) in terms of $ε, δ$, and VC dimension.
