Adversarial Robustness in One-Stage Learning-to-Defer
Yannis Montreuil, Letian Yu, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi
TL;DR
This work introduces the first framework for adversarial robustness in one-stage Learning-to-Defer (L2D), addressing both classification and regression. It formalizes untargeted and targeted attacks that perturb inputs to influence both predictions and deferral decisions, and it proposes cost-sensitive, tractable surrogate losses with theoretical guarantees including Bayes-consistency, $\mathcal{H}$-consistency for classification, and $(\mathcal{R},\mathcal{F})$-consistency for regression. The authors also present smooth adversarial surrogates and regularized ERM algorithms (RERM-C for classification and RERM-R for regression) to achieve robust end-to-end L2D while preserving clean accuracy. Empirical results on CIFAR-10, DermaMNIST, Communities and Crime, and COIL 2000 demonstrate substantial robustness improvements under both untargeted and targeted attacks, with lower adversarial deferral losses and competitive or better clean performance. Overall, the paper provides both theoretical foundations and practical methods for deploying robust one-stage L2D systems in safety-critical settings where both predictions and deferrals must be reliable under adversarial perturbations.
Abstract
Learning-to-Defer (L2D) enables hybrid decision-making by routing inputs either to a predictor or to external experts. While promising, L2D is highly vulnerable to adversarial perturbations, which can not only flip predictions but also manipulate deferral decisions. Prior robustness analyses focus solely on two-stage settings, leaving open the end-to-end (one-stage) case where predictor and allocation are trained jointly. We introduce the first framework for adversarial robustness in one-stage L2D, covering both classification and regression. Our approach formalizes attacks, proposes cost-sensitive adversarial surrogate losses, and establishes theoretical guarantees including $\mathcal{H}$, $(\mathcal{R }, \mathcal{F})$, and Bayes consistency. Experiments on benchmark datasets confirm that our methods improve robustness against untargeted and targeted attacks while preserving clean performance.
