Table of Contents
Fetching ...

Holistic Robust Data-Driven Decisions

Amine Bennouna, Bart Van Parys, Ryan Lucas

TL;DR

This work tackles the challenge of data-driven decisions with reliable out-of-sample performance by identifying three distinct overfitting sources: statistical error, noise, and misspecification. It introduces Holistic Robust (HR) optimization, a distributionally robust framework that combines a Lévy-Prokhorov–type ambiguity for noise and corruption with a KL-divergence ball to guard against statistical error, delivering uniform out-of-sample guarantees and tractable reformulations. The authors show that HR subsumes several existing DRO formulations as special cases and provide finite, dual, and efficient computational strategies. They validate HR on neural-network training with healthcare data and a real-stock portfolio problem, demonstrating improved robustness, calibration, and favorable return–risk tradeoffs across distribution shifts. The modular HR approach offers practical guidance for hyperparameter tuning and extends to various data-corruption scenarios, making it a versatile tool for robust decision-making in high-stakes settings.

Abstract

The design of data-driven formulations for machine learning and decision-making with good out-of-sample performance is a key challenge. The observation that good in-sample performance does not guarantee good out-of-sample performance is generally known as overfitting. Practical overfitting can typically not be attributed to a single cause but is caused by several factors simultaneously. We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise, which occurs when the data points are measured only with finite precision, and finally, (iii) data misspecification in which a small fraction of all data may be wholly corrupted. Although existing data-driven formulations may be robust against one of these three sources in isolation, they do not provide holistic protection against all overfitting sources simultaneously. We design a novel data-driven formulation that guarantees such holistic protection and is computationally viable. Our distributionally robust optimization formulation can be interpreted as a novel combination of a Kullback-Leibler and Lévy-Prokhorov robust optimization formulation. In the context of classification and regression problems, we show that several popular regularized and robust formulations naturally reduce to a particular case of our proposed novel formulation. Finally, we apply the proposed HR formulation to two real-life applications and study it alongside several benchmarks: (1) training neural networks on healthcare data, where we analyze various robustness and generalization properties in the presence of noise, labeling errors, and scarce data, (2) a portfolio selection problem with real stock data, and analyze the risk/return tradeoff under the natural severe distribution shift of the application.

Holistic Robust Data-Driven Decisions

TL;DR

This work tackles the challenge of data-driven decisions with reliable out-of-sample performance by identifying three distinct overfitting sources: statistical error, noise, and misspecification. It introduces Holistic Robust (HR) optimization, a distributionally robust framework that combines a Lévy-Prokhorov–type ambiguity for noise and corruption with a KL-divergence ball to guard against statistical error, delivering uniform out-of-sample guarantees and tractable reformulations. The authors show that HR subsumes several existing DRO formulations as special cases and provide finite, dual, and efficient computational strategies. They validate HR on neural-network training with healthcare data and a real-stock portfolio problem, demonstrating improved robustness, calibration, and favorable return–risk tradeoffs across distribution shifts. The modular HR approach offers practical guidance for hyperparameter tuning and extends to various data-corruption scenarios, making it a versatile tool for robust decision-making in high-stakes settings.

Abstract

The design of data-driven formulations for machine learning and decision-making with good out-of-sample performance is a key challenge. The observation that good in-sample performance does not guarantee good out-of-sample performance is generally known as overfitting. Practical overfitting can typically not be attributed to a single cause but is caused by several factors simultaneously. We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise, which occurs when the data points are measured only with finite precision, and finally, (iii) data misspecification in which a small fraction of all data may be wholly corrupted. Although existing data-driven formulations may be robust against one of these three sources in isolation, they do not provide holistic protection against all overfitting sources simultaneously. We design a novel data-driven formulation that guarantees such holistic protection and is computationally viable. Our distributionally robust optimization formulation can be interpreted as a novel combination of a Kullback-Leibler and Lévy-Prokhorov robust optimization formulation. In the context of classification and regression problems, we show that several popular regularized and robust formulations naturally reduce to a particular case of our proposed novel formulation. Finally, we apply the proposed HR formulation to two real-life applications and study it alongside several benchmarks: (1) training neural networks on healthcare data, where we analyze various robustness and generalization properties in the presence of noise, labeling errors, and scarce data, (2) a portfolio selection problem with real stock data, and analyze the risk/return tradeoff under the natural severe distribution shift of the application.
Paper Structure (67 sections, 21 theorems, 139 equations, 22 figures, 4 tables)

This paper contains 67 sections, 21 theorems, 139 equations, 22 figures, 4 tables.

Key Result

theorem 1

Assume the noise $\tn$ realizes in the set $\cN$ and the misspecification probability is less than $\alpha$, i.e., $\Prob(\tilde{c}=1) < \alpha$. For a given distribution $\Pb^c\in\cP$ of the corrupted sample $\txi^c\in \Sigma$, the set of possible out-of-sample distributions $\Pb$ of $\txi$ is $\se

Figures (22)

  • Figure 1: Two distributions $\mu$ and $\nu$ which are equivalent up to a small distributional shift $\epsilon>0$ are very dissimilar in terms of their entropic divergence, i.e., $\KL(\mu, \nu)=\infty$.
  • Figure 2: Two distributions $\mu$ and $\nu$ satisfy the constraint $\LP_{\cN}(\mu,\nu)\leq \alpha$ if there is a coupling $\gamma$ so that at most $\alpha$ mass of the coupling is assigned to the outside of the cylinder strip associated with the noise set $\cN$.
  • Figure 3: Illustration of the LP-DRO expression of Theorem \ref{['thm: LP-DRO expression for empirical']}. The circles represent the observed corrupted samples ordered by increasing inflated loss $\loss^\cN(x,\xi_{[1]}) \leq \ldots \leq \loss^\cN(x,\xi_{[T]})$, with $p = \lceil \alpha T \rceil$. The filled part represent the $1-\alpha$ fraction with highest inflated loss $\loss^\cN$. The LP-DRO predictor corresponds to the total loss of this $1-\alpha$ fraction of the samples with highest loss $\loss^\cN$ (which is $(1-\alpha)\CVaR$) plus $\alpha$ times the worst-case scenario. The adversary replaces the $\alpha$ fraction of the samples with lowest inflated loss $\loss^\cN$ with the worst-case loss.
  • Figure 4: When learning with corrupted data, data points are first sampled independently from $\Pb$. Sampling from $\Pb$ itself results in statistical errors which however remain bounded with high probability $1-\exp(-rT+o(T))$ by $r$ as measured in by the KL divergence between $\Pb$ and $\hat{\mathbb{Q}}$. Subsequently, an adversary corrupts these samples with noise in $\cN$ and misspecification at frequency at most $\alpha$. Theorem \ref{['thm: LP-DRO robustness']} guarantees that the distance between $\hat{\mathbb{Q}}$ and $\Pemp{T}$ is a most $\alpha$ as measured by the pseudometric $\LP_{\cN}$.
  • Figure 5: In the data generation process with oblivious advesaries, an adversary can corrupt the data generation distribution away from $\Pb$ towards any $\Qb$ within distance $\alpha$ as measured by our pseudo metric $\LP_\cN$. Sampling from $\Qb$ itself results in statistical errors which however remain bounded with high probability $1-\exp(-rT)$ by $r$ as measured in by the KL divergence between $\Pemp{T}$ and $\mathbb Q$.
  • ...and 17 more figures

Theorems & Definitions (37)

  • theorem 1: Corruption set characterization
  • corollary 1: Robustness against noise and misspecification
  • theorem 2: Efficient robustness
  • theorem 3
  • remark 1: Robust Statistics and Outliers
  • corollary 2
  • theorem 4: Uniform Robustness Out-of-sample Guarantee
  • theorem 5: Efficient Robustness
  • theorem 6: Finite Primal Reduction
  • theorem 7: Dual Formulation
  • ...and 27 more