PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

Adel Javanmard; Matthew Fahrbach; Vahab Mirrokni

PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

Adel Javanmard, Matthew Fahrbach, Vahab Mirrokni

TL;DR

The paper addresses learning from aggregate responses by designing bags that are homogeneous with respect to unobserved individual responses. It shows that, with a prior on expected responses, optimal bagging reduces to a one-dimensional, size-constrained ${k}$-means clustering problem, and extends this to GLMs with practical simplifications for logistic and Poisson models. The proposed PriorBoost algorithm iteratively refines bags using prior predictions derived from training data, achieving near-optimal event-level prediction quality and outperforming random bagging, even under label differential privacy. Theoretical analysis demonstrates a clear advantage over random bagging in bias-variance terms, and experiments across linear and logistic regression tasks, plus DP settings, validate substantial gains in model utility and robustness. The work offers a practical framework for private, aggregate-based learning with adaptable bag construction and principled privacy considerations.

Abstract

This work studies algorithms for learning from aggregate responses. We focus on the construction of aggregation sets (called bags in the literature) for event-level loss functions. We prove for linear regression and generalized linear models (GLMs) that the optimal bagging problem reduces to one-dimensional size-constrained $k$-means clustering. Further, we theoretically quantify the advantage of using curated bags over random bags. We then propose the PriorBoost algorithm, which adaptively forms bags of samples that are increasingly homogeneous with respect to (unobserved) individual responses to improve model quality. We study label differential privacy for aggregate learning, and we also provide extensive experiments showing that PriorBoost regularly achieves optimal model quality for event-level predictions, in stark contrast to non-adaptive algorithms.

PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

TL;DR

-means clustering problem, and extends this to GLMs with practical simplifications for logistic and Poisson models. The proposed PriorBoost algorithm iteratively refines bags using prior predictions derived from training data, achieving near-optimal event-level prediction quality and outperforming random bagging, even under label differential privacy. Theoretical analysis demonstrates a clear advantage over random bagging in bias-variance terms, and experiments across linear and logistic regression tasks, plus DP settings, validate substantial gains in model utility and robustness. The work offers a practical framework for private, aggregate-based learning with adaptable bag construction and principled privacy considerations.

Abstract

-means clustering. Further, we theoretically quantify the advantage of using curated bags over random bags. We then propose the PriorBoost algorithm, which adaptively forms bags of samples that are increasingly homogeneous with respect to (unobserved) individual responses to improve model quality. We study label differential privacy for aggregate learning, and we also provide extensive experiments showing that PriorBoost regularly achieves optimal model quality for event-level predictions, in stark contrast to non-adaptive algorithms.

Paper Structure (33 sections, 13 theorems, 99 equations, 4 figures, 1 algorithm)

This paper contains 33 sections, 13 theorems, 99 equations, 4 figures, 1 algorithm.

Introduction
Problem statement
Overview of our approach and contributions
Other related work
Warm-up: Linear regression
Bounding the estimator error
Reducing to size-constrained $k$-means clustering
Extension to GLMs
Logistic regression.
Poisson regression.
Comparison with random bagging
Algorithm
Differential privacy for aggregate responses
Experiments
Linear regression
...and 18 more sections

Key Result

Theorem 2.1

If the design matrix $\bm{X}\in\mathbb{R}^{n\times d}$ has rank $d$, then for the estimator ${\widehat{\bm{\theta}}}$ given by eq:hth, we have

Figures (4)

Figure 1: Linear regression. Compares PriorBoost (solid) with OneShot (left, dotted) and PBPrefix (right, dashed) by plotting test MSE at each step $t$ for different bag sizes $k$.
Figure 2: Logistic regression. Compare PriorBoost (solid) with OneShot (left, dotted) and PBPrefix (right, dashed) by plotting test log loss at each step $t$ for different bag sizes $k$.
Figure 3: $\varepsilon$-label differentially private logistic regression. Compares final PriorBoost (solid) test log loss to OneShot (dashed) for different bag sizes $k$.
Figure 4: Optimal bag sizes $k$ for $\varepsilon$-label DP PriorBoost for logistic regression. Test loss for $\varepsilon = 1$ as the number of samples $n$ increases.

Theorems & Definitions (21)

Theorem 2.1
Corollary 2.2
Lemma 2.3: Sorting structure
Theorem 3.1
Corollary 3.2
Definition 4.1
Theorem 4.4
Theorem 4.5
Remark 4.6
Remark 4.7
...and 11 more

PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

TL;DR

Abstract

PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (21)