SADA: Safe and Adaptive Aggregation of Multiple Black-Box Predictions in Semi-Supervised Learning
Jiawei Shan, Zhifeng Chen, Yiming Dong, Yazhen Wang, Jiwei Zhao
TL;DR
SADA provides a principled, data-driven framework to safely and adaptively aggregate multiple black-box predictions in semi-supervised settings. By formulating an unbiased augmented estimating equation and selecting an optimal weight vector, SADA guarantees performance not worse than the labeled-only baseline and exploits highly informative predictors to improve efficiency or achieve semiparametric efficiency. The approach is supported by theoretical guarantees, simulations, and two real-data analyses (politeness regression and ImageNet classification), and is accessible via an R package. This enables robust use of diverse predictions from models and systems with uncertain quality in SSL contexts, with broad practical impact for inference and prediction tasks.
Abstract
Semi-supervised learning (SSL) arises in practice when labeled data are scarce or expensive to obtain, while large quantities of unlabeled data are readily available. With the growing adoption of machine learning techniques, it has become increasingly feasible to generate multiple predicted labels using a variety of models and algorithms, including deep learning, large language models, and generative AI. In this paper, we propose a novel approach that safely and adaptively aggregates multiple black-box predictions of uncertain quality for both inference and prediction tasks. Our method provides two key guarantees: (i) it never performs worse than using the labeled data alone, regardless of the quality of the predictions; and (ii) if any one of the predictions (without knowing which one) perfectly fits the ground truth, the algorithm adaptively exploits this to achieve either a faster convergence rate or the semiparametric efficiency bound. We demonstrate the effectiveness of the proposed algorithm through small-scale simulations and two real-data analyses with distinct scientific goals. A user-friendly R package, sada, is provided to facilitate practical implementation.
