Table of Contents
Fetching ...

Aggregating Data for Optimal and Private Learning

Sushant Agarwal, Yukti Makhija, Rishi Saket, Aravindan Raghuveer

TL;DR

This work analyzes how to partition data into bags in MIR and LLP settings to maximize downstream linear regression utility. It shows that, across instance-level MIR loss, LLP bag-level loss, and aggregate-level MIR, the optimal bagging strategies closely correspond to $k$-means clustering of either the labels or the feature vectors, and it extends these insights to Generalized Linear Models. The authors provide formal utility bounds, establish label-DP privacy guarantees with quantified extra error, and validate the theory with experiments. The results offer a principled approach to bag construction in aggregate-label learning, with practical implications for privacy-preserving model training on MIR/LLP data and GLMs.

Abstract

Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are learning frameworks arising in many applications, where the training data is partitioned into disjoint sets or bags, and only an aggregate label i.e., bag-label for each bag is available to the learner. In the case of MIR, the bag-label is the label of an undisclosed instance from the bag, while in LLP, the bag-label is the mean of the bag's labels. In this paper, we study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags such that the utility for downstream tasks like linear regression is maximized. We theoretically provide utility guarantees, and show that in each case, the optimal bagging strategy (approximately) reduces to finding an optimal clustering of the feature vectors or the labels with respect to natural objectives such as $k$-means. We also show that our bagging mechanisms can be made label-differentially private, incurring an additional utility error. We then generalize our results to the setting of Generalized Linear Models (GLMs). Finally, we experimentally validate our theoretical results.

Aggregating Data for Optimal and Private Learning

TL;DR

This work analyzes how to partition data into bags in MIR and LLP settings to maximize downstream linear regression utility. It shows that, across instance-level MIR loss, LLP bag-level loss, and aggregate-level MIR, the optimal bagging strategies closely correspond to -means clustering of either the labels or the feature vectors, and it extends these insights to Generalized Linear Models. The authors provide formal utility bounds, establish label-DP privacy guarantees with quantified extra error, and validate the theory with experiments. The results offer a principled approach to bag construction in aggregate-label learning, with practical implications for privacy-preserving model training on MIR/LLP data and GLMs.

Abstract

Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are learning frameworks arising in many applications, where the training data is partitioned into disjoint sets or bags, and only an aggregate label i.e., bag-label for each bag is available to the learner. In the case of MIR, the bag-label is the label of an undisclosed instance from the bag, while in LLP, the bag-label is the mean of the bag's labels. In this paper, we study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags such that the utility for downstream tasks like linear regression is maximized. We theoretically provide utility guarantees, and show that in each case, the optimal bagging strategy (approximately) reduces to finding an optimal clustering of the feature vectors or the labels with respect to natural objectives such as -means. We also show that our bagging mechanisms can be made label-differentially private, incurring an additional utility error. We then generalize our results to the setting of Generalized Linear Models (GLMs). Finally, we experimentally validate our theoretical results.

Paper Structure

This paper contains 44 sections, 27 theorems, 97 equations, 2 figures, 5 tables.

Key Result

Theorem 1

For $\hat{\theta}$ as in eq:instanceloss, for a given bagging $B$, where constants $C_1, C_2$ are independent of $B$.

Figures (2)

  • Figure 1: Random bagging algorithm for bag-LLP
  • Figure 2: Random bagging algorithm for Agg-MIR

Theorems & Definitions (51)

  • Definition 1: Instance-level loss
  • Theorem 1
  • Definition 2: Bag-level loss
  • Theorem 2
  • Definition 3: Aggregate-level loss
  • Theorem 3
  • Definition 4: Label DP
  • Theorem 4
  • Theorem 5
  • proof
  • ...and 41 more