Distributionally Robust Optimization with Adversarial Data Contamination
Shuyao Li, Ilias Diakonikolas, Jelena Diakonikolas
TL;DR
This work addresses learning under distributional shifts when training data may be adversarially contaminated, focusing on Wasserstein-1 DRO with convex Lipschitz losses and an ${\epsilon}$-fraction of corrupted samples. It introduces a principled modeling framework that separates pre-decision data contamination from post-decision shifts and provides an efficient primal-dual algorithm with provable guarantees, achieving an estimation error of ${O(\|w^*\|_2 \sigma \zeta \sqrt{\epsilon})}$ under a bounded-covariance assumption. The paper also shows how the DRO objective can be reformulated as a regularized risk ${\min_w} \mathbb{E}_{P_0}[\ell(w; x, y)] + \rho\zeta\|w\|_s$, and demonstrates robust mean-estimation based inexactness driving the algorithm, with a concrete application to SVMs yielding an ${O(\epsilon^{1/4})}$ error under anti-concentration. Overall, the results establish the first rigorous, computationally efficient guarantees for learning under simultaneous data contamination and distributional shifts, offering a principled path for robust deployment in real-world settings.
Abstract
Distributionally Robust Optimization (DRO) provides a framework for decision-making under distributional uncertainty, yet its effectiveness can be compromised by outliers in the training data. This paper introduces a principled approach to simultaneously address both challenges. We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions, where an $ε$-fraction of the training data is adversarially corrupted. Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts, alongside an efficient algorithm inspired by robust statistics to solve the resulting optimization problem. We prove that our method achieves an estimation error of $O(\sqrtε)$ for the true DRO objective value using only the contaminated data under the bounded covariance assumption. This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.
