Table of Contents
Fetching ...

Bayesian Nonparametrics Meets Data-Driven Distributionally Robust Optimization

Nicola Bariletto, Nhat Ho

TL;DR

This work proposes a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet process) theory and a recent decision-theoretic model of smooth ambiguity-averse preferences and shows that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization.

Abstract

Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet process) theory and a recent decision-theoretic model of smooth ambiguity-averse preferences. First, we highlight novel connections with standard regularized empirical risk minimization techniques, among which Ridge and LASSO regressions. Then, we theoretically demonstrate the existence of favorable finite-sample and asymptotic statistical guarantees on the performance of the robust optimization procedure. For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet process representations. We also show that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization. Finally, we provide insights into the workings of our method by applying it to a variety of tasks based on simulated and real datasets.

Bayesian Nonparametrics Meets Data-Driven Distributionally Robust Optimization

TL;DR

This work proposes a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet process) theory and a recent decision-theoretic model of smooth ambiguity-averse preferences and shows that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization.

Abstract

Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet process) theory and a recent decision-theoretic model of smooth ambiguity-averse preferences. First, we highlight novel connections with standard regularized empirical risk minimization techniques, among which Ridge and LASSO regressions. Then, we theoretically demonstrate the existence of favorable finite-sample and asymptotic statistical guarantees on the performance of the robust optimization procedure. For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet process representations. We also show that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization. Finally, we provide insights into the workings of our method by applying it to a variety of tasks based on simulated and real datasets.
Paper Structure (51 sections, 13 theorems, 67 equations, 4 figures, 3 tables, 4 algorithms)

This paper contains 51 sections, 13 theorems, 67 equations, 4 figures, 3 tables, 4 algorithms.

Key Result

Proposition 2.1

Let $h(\theta, (y, x)) = (y-\theta^\top x)^2$. Then, denoting $\lambda_{\alpha, n}:=\alpha/n$, the following equivalences hold: 1. If $p_0 = \mathcal{N}(0, I)$, then $\hat{\theta}$ solving eq:ambiguity_neutral_criterion implies that it solves 2. If $V=\textnormal{diag}(\vert\theta_1\vert^{-1}, \dots, \vert\theta_{d-1}\vert^{-1})$ and $p_0 = \mathcal{N}(0, V)$, then $\hat{\theta}$ solving eq:ambig

Figures (4)

  • Figure 1: Graphical display of smooth ambiguity aversion at work. Although $\theta_1$ and $\theta_2$ yield the same loss $\mathcal{R}^*$ in $Q$-expectation, the ambiguity averse criterion favors the less variable decision $\theta_1$. Graphically, this is because the orange line connecting $\phi(\mathcal{R}_{p_1}(\theta_1))$ to $\phi(\mathcal{R}_{p_2}(\theta_1))$ lies (point-wise) below the line connecting $\phi(\mathcal{R}_{p_1}(\theta_2))$ to $\phi(\mathcal{R}_{p_2}(\theta_2))$.
  • Figure 2: Simulation results for the high-dimensional sparse linear regression experiment. Bars report the mean and standard deviation (across 200 sample simulations) of the test RMSE, $L_2$ distance of estimated coefficient vector $\hat{\theta}$ from the data-generating one, and the $L_2$ norm of $\hat{\theta}$. Results are shown for the ambiguity-averse, ambiguity-neutral, and OLS procedures. Note: The left (blue) axis refers to mean values, the right (orange) axis to standard deviation values.
  • Figure 3: Simulation results from the experiment on Gaussian mean estimation with outliers. Bars report the mean and standard deviation (across 100 sample simulations) of the test mean negative log-likelihood and the absolute value distance of the estimated parameter from 0 (the data-generating value). Results are shown for the ambiguity-averse, ambiguity-neutral, and MLE procedures. Note: The left (blue) axis refers to mean values, the right (orange) axis to standard deviation values.
  • Figure 4: Simulation results for the high-dimensional sparse logistic regression experiment. Bars report the mean and standard deviation (across 200 sample simulations) of the test average loss, $L_2$ distance of estimated coefficient vector $\hat{\theta}$ from the data-generating one, and the $L_2$ norm of $\hat{\theta}$. Results are shown for the ambiguity-averse, $L_2$-regularized, and un-regularized procedures. Note: The left (blue) axis refers to mean values, the right (orange) axis to standard deviation values.

Theorems & Definitions (21)

  • Proposition 2.1
  • Proposition 3.1
  • Lemma 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Remark 3.5
  • Theorem 3.6
  • Remark 4.1
  • Remark 4.2
  • Lemma 4.3
  • ...and 11 more