Table of Contents
Fetching ...

It's All in the Mix: Wasserstein Classification and Regression with Mixed Features

Reza Belbasi, Aras Selvi, Wolfram Wiesemann

TL;DR

The paper tackles distributionally robust learning with mixed continuous and discrete features under Wasserstein ambiguity, addressing the exponential-scaling challenge of traditional formulations. It develops a cutting plane algorithm with a polynomial-time separation oracle to solve the resulting exponential-size convex program, establishes that mixed-feature problems are NP-hard in general yet tractable in important cases, and proves an equivalence to bounded continuous-feature formulations. Empirically, the method outperforms nominal and unbounded continuous-feature approaches, especially when discrete features are prevalent, and offers substantial computational advantages over monolithic reformulations. The work thus provides both theoretical foundations and practical tools for robust prediction in operations management and related domains with mixed feature types, along with open-source resources for further experimentation.

Abstract

Problem definition: A key challenge in supervised learning is data scarcity, which can cause prediction models to overfit to the training data and perform poorly out of sample. A contemporary approach to combat overfitting is offered by distributionally robust problem formulations that consider all data-generating distributions close to the empirical distribution derived from historical samples, where 'closeness' is determined by the Wasserstein distance. While such formulations show significant promise in prediction tasks where all input features are continuous, they scale exponentially when discrete features are present. Methodology/results: We demonstrate that distributionally robust mixed-feature classification and regression problems can indeed be solved in polynomial time. Our proof relies on classical ellipsoid method-based solution schemes that do not scale well in practice. To overcome this limitation, we develop a practically efficient (yet, in the worst case, exponential time) cutting plane-based algorithm that admits a polynomial time separation oracle, despite the presence of exponentially many constraints. We compare our method against alternative techniques both theoretically and empirically on standard benchmark instances. Managerial implications: Data-driven operations management problems often involve prediction models with discrete features. We develop and analyze distributionally robust prediction models that faithfully account for the presence of discrete features, and we demonstrate that our models can significantly outperform existing methods that are agnostic to the presence of discrete features, both theoretically and on standard benchmark instances.

It's All in the Mix: Wasserstein Classification and Regression with Mixed Features

TL;DR

The paper tackles distributionally robust learning with mixed continuous and discrete features under Wasserstein ambiguity, addressing the exponential-scaling challenge of traditional formulations. It develops a cutting plane algorithm with a polynomial-time separation oracle to solve the resulting exponential-size convex program, establishes that mixed-feature problems are NP-hard in general yet tractable in important cases, and proves an equivalence to bounded continuous-feature formulations. Empirically, the method outperforms nominal and unbounded continuous-feature approaches, especially when discrete features are prevalent, and offers substantial computational advantages over monolithic reformulations. The work thus provides both theoretical foundations and practical tools for robust prediction in operations management and related domains with mixed feature types, along with open-source resources for further experimentation.

Abstract

Problem definition: A key challenge in supervised learning is data scarcity, which can cause prediction models to overfit to the training data and perform poorly out of sample. A contemporary approach to combat overfitting is offered by distributionally robust problem formulations that consider all data-generating distributions close to the empirical distribution derived from historical samples, where 'closeness' is determined by the Wasserstein distance. While such formulations show significant promise in prediction tasks where all input features are continuous, they scale exponentially when discrete features are present. Methodology/results: We demonstrate that distributionally robust mixed-feature classification and regression problems can indeed be solved in polynomial time. Our proof relies on classical ellipsoid method-based solution schemes that do not scale well in practice. To overcome this limitation, we develop a practically efficient (yet, in the worst case, exponential time) cutting plane-based algorithm that admits a polynomial time separation oracle, despite the presence of exponentially many constraints. We compare our method against alternative techniques both theoretically and empirically on standard benchmark instances. Managerial implications: Data-driven operations management problems often involve prediction models with discrete features. We develop and analyze distributionally robust prediction models that faithfully account for the presence of discrete features, and we demonstrate that our models can significantly outperform existing methods that are agnostic to the presence of discrete features, both theoretically and on standard benchmark instances.
Paper Structure (13 sections, 21 theorems, 72 equations, 2 figures, 8 tables, 2 algorithms)

This paper contains 13 sections, 21 theorems, 72 equations, 2 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

The following problems admit equivalent reformulations in the form of eq-most-vio-generic:

Figures (2)

  • Figure 1: (Worst-case) distributions for our three formulations. The bars represent the mean probabilities, and the whiskers indicate the corresponding standard deviations, computed over 10,000 statistically independent simulations.
  • Figure 2: Mean out-of-sample losses for various classification (left) and regression (right) tasks when the number of discrete features that are treated as such varies. All losses are scaled to $[0,1]$ and shifted so that the curves do not overlap.

Theorems & Definitions (43)

  • Definition 1: Wasserstein Distance
  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Theorem 3: Complexity of the Unified Problem Representation \ref{['eq-most-vio-generic']}
  • Theorem 4: Absence of Regularizers
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Corollary 1
  • ...and 33 more