Holistic Robust Data-Driven Decisions
Amine Bennouna, Bart Van Parys, Ryan Lucas
TL;DR
This work tackles the challenge of data-driven decisions with reliable out-of-sample performance by identifying three distinct overfitting sources: statistical error, noise, and misspecification. It introduces Holistic Robust (HR) optimization, a distributionally robust framework that combines a Lévy-Prokhorov–type ambiguity for noise and corruption with a KL-divergence ball to guard against statistical error, delivering uniform out-of-sample guarantees and tractable reformulations. The authors show that HR subsumes several existing DRO formulations as special cases and provide finite, dual, and efficient computational strategies. They validate HR on neural-network training with healthcare data and a real-stock portfolio problem, demonstrating improved robustness, calibration, and favorable return–risk tradeoffs across distribution shifts. The modular HR approach offers practical guidance for hyperparameter tuning and extends to various data-corruption scenarios, making it a versatile tool for robust decision-making in high-stakes settings.
Abstract
The design of data-driven formulations for machine learning and decision-making with good out-of-sample performance is a key challenge. The observation that good in-sample performance does not guarantee good out-of-sample performance is generally known as overfitting. Practical overfitting can typically not be attributed to a single cause but is caused by several factors simultaneously. We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise, which occurs when the data points are measured only with finite precision, and finally, (iii) data misspecification in which a small fraction of all data may be wholly corrupted. Although existing data-driven formulations may be robust against one of these three sources in isolation, they do not provide holistic protection against all overfitting sources simultaneously. We design a novel data-driven formulation that guarantees such holistic protection and is computationally viable. Our distributionally robust optimization formulation can be interpreted as a novel combination of a Kullback-Leibler and Lévy-Prokhorov robust optimization formulation. In the context of classification and regression problems, we show that several popular regularized and robust formulations naturally reduce to a particular case of our proposed novel formulation. Finally, we apply the proposed HR formulation to two real-life applications and study it alongside several benchmarks: (1) training neural networks on healthcare data, where we analyze various robustness and generalization properties in the presence of noise, labeling errors, and scarce data, (2) a portfolio selection problem with real stock data, and analyze the risk/return tradeoff under the natural severe distribution shift of the application.
