Table of Contents
Fetching ...

SADA: Safe and Adaptive Aggregation of Multiple Black-Box Predictions in Semi-Supervised Learning

Jiawei Shan, Zhifeng Chen, Yiming Dong, Yazhen Wang, Jiwei Zhao

TL;DR

SADA provides a principled, data-driven framework to safely and adaptively aggregate multiple black-box predictions in semi-supervised settings. By formulating an unbiased augmented estimating equation and selecting an optimal weight vector, SADA guarantees performance not worse than the labeled-only baseline and exploits highly informative predictors to improve efficiency or achieve semiparametric efficiency. The approach is supported by theoretical guarantees, simulations, and two real-data analyses (politeness regression and ImageNet classification), and is accessible via an R package. This enables robust use of diverse predictions from models and systems with uncertain quality in SSL contexts, with broad practical impact for inference and prediction tasks.

Abstract

Semi-supervised learning (SSL) arises in practice when labeled data are scarce or expensive to obtain, while large quantities of unlabeled data are readily available. With the growing adoption of machine learning techniques, it has become increasingly feasible to generate multiple predicted labels using a variety of models and algorithms, including deep learning, large language models, and generative AI. In this paper, we propose a novel approach that safely and adaptively aggregates multiple black-box predictions of uncertain quality for both inference and prediction tasks. Our method provides two key guarantees: (i) it never performs worse than using the labeled data alone, regardless of the quality of the predictions; and (ii) if any one of the predictions (without knowing which one) perfectly fits the ground truth, the algorithm adaptively exploits this to achieve either a faster convergence rate or the semiparametric efficiency bound. We demonstrate the effectiveness of the proposed algorithm through small-scale simulations and two real-data analyses with distinct scientific goals. A user-friendly R package, sada, is provided to facilitate practical implementation.

SADA: Safe and Adaptive Aggregation of Multiple Black-Box Predictions in Semi-Supervised Learning

TL;DR

SADA provides a principled, data-driven framework to safely and adaptively aggregate multiple black-box predictions in semi-supervised settings. By formulating an unbiased augmented estimating equation and selecting an optimal weight vector, SADA guarantees performance not worse than the labeled-only baseline and exploits highly informative predictors to improve efficiency or achieve semiparametric efficiency. The approach is supported by theoretical guarantees, simulations, and two real-data analyses (politeness regression and ImageNet classification), and is accessible via an R package. This enables robust use of diverse predictions from models and systems with uncertain quality in SSL contexts, with broad practical impact for inference and prediction tasks.

Abstract

Semi-supervised learning (SSL) arises in practice when labeled data are scarce or expensive to obtain, while large quantities of unlabeled data are readily available. With the growing adoption of machine learning techniques, it has become increasingly feasible to generate multiple predicted labels using a variety of models and algorithms, including deep learning, large language models, and generative AI. In this paper, we propose a novel approach that safely and adaptively aggregates multiple black-box predictions of uncertain quality for both inference and prediction tasks. Our method provides two key guarantees: (i) it never performs worse than using the labeled data alone, regardless of the quality of the predictions; and (ii) if any one of the predictions (without knowing which one) perfectly fits the ground truth, the algorithm adaptively exploits this to achieve either a faster convergence rate or the semiparametric efficiency bound. We demonstrate the effectiveness of the proposed algorithm through small-scale simulations and two real-data analyses with distinct scientific goals. A user-friendly R package, sada, is provided to facilitate practical implementation.

Paper Structure

This paper contains 24 sections, 6 theorems, 79 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Among the family of estimators $\widehat{{\bm\theta}}(\mathcal{W})$ solving eq:generalEE, the optimal tuning parameter, $\mathcal{W}^\text{\normalfont opt}$, that minimizes the mean squared error loss such that $\mathbb{E}[\{\widehat{{\bm\theta}}(\mathcal{W}^\text{\normalfont opt})-{\bm\theta}^*\}^{ where $\mathcal{S}(\mathbf{x},\widehat{\mathbf{y}};\theta):= (\mathbf{s}(\mathbf{x},\widehat{y}_1;\

Figures (4)

  • Figure 1: Protocol for computing the proposed SADA estimator $\widehat{{\bm\theta}}^\text{\normalfont sada}$.
  • Figure 2: Relative efficiency of different methods compared to the naive method as prediction quality varies. Stars and triangles on the line indicate standard deviations over 1000 replications; scatter points represent those from individual replications.
  • Figure 3: Comparison of standard deviations of different methods leveraging various prediction strategies. The estimand is the regression coefficient of politeness score on indicative modal features.
  • Figure 4: Top-1 accuracy of various methods on the test images across increasing numbers of labeled data.

Theorems & Definitions (15)

  • Proposition 1
  • Proposition 2
  • Remark 1: Interpretation
  • Remark 2: Comparison with PPI++
  • Theorem 1: Safety
  • Theorem 2: Adaptivity
  • Theorem 3: Excess risk
  • proof
  • proof : Proof of the oracle case
  • proof : Proof of Proposition \ref{['prop:mean_eif']}.
  • ...and 5 more