Table of Contents
Fetching ...

Mixed-feature Logistic Regression Robust to Distribution Shifts

Qingshi Sun, Nathan Justin, Andres Gomez, Phebe Vayanos

TL;DR

This work addresses logistic regression under distribution shifts by formulating a Wasserstein-robust, mixed-feature DRO model that permits feature-wise heterogeneity in shift likelihood. It develops two scalable solution approaches—the cutting-plane method with a dynamic-programming constraint-violation oracle and a graph-based reformulation that maps constraint evaluation to longest-path problems on per-data-point DAGs—yielding large runtime gains (up to ~408x) and improved predictive reliability. Calibration under shifts is achieved via a principled parameter-tuning scheme that ties perturbation costs to domain knowledge, including explicit expressions for $\gamma_j$, $\delta_\ell$, and $\epsilon$ in terms of shift probabilities and a likelihood-ratio threshold $\theta$. Empirically, the method reduces calibration error and increases AUC (up to 36% and 18% on average, respectively, with larger improvements in worst-case metrics) on 13 UCI datasets, demonstrating practical applicability for high-stakes domains with heterogeneous distribution shifts.

Abstract

Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem that seeks the model that will perform best against adversarial realizations of the data distribution drawn from a suitably constructed Wasserstein ambiguity set. Our model and solution approach differ from prior work in that we can capture settings where the likelihood of distribution shifts can vary across features, significantly broadening the applicability of our model relative to the state-of-the-art. We propose a graph-based solution approach that can be integrated into off-the-shelf optimization solvers. We evaluate the performance of our model and algorithms on numerous publicly available datasets. Our solution achieves a 408x speed-up relative to the state-of-the-art. Additionally, compared to the state-of-the-art, our model reduces average calibration error by up to 36.19% and worst-case calibration error by up to 41.70%, while increasing the average area under the ROC curve (AUC) by up to 18.02% and worst-case AUC by up to 48.37%.

Mixed-feature Logistic Regression Robust to Distribution Shifts

TL;DR

This work addresses logistic regression under distribution shifts by formulating a Wasserstein-robust, mixed-feature DRO model that permits feature-wise heterogeneity in shift likelihood. It develops two scalable solution approaches—the cutting-plane method with a dynamic-programming constraint-violation oracle and a graph-based reformulation that maps constraint evaluation to longest-path problems on per-data-point DAGs—yielding large runtime gains (up to ~408x) and improved predictive reliability. Calibration under shifts is achieved via a principled parameter-tuning scheme that ties perturbation costs to domain knowledge, including explicit expressions for , , and in terms of shift probabilities and a likelihood-ratio threshold . Empirically, the method reduces calibration error and increases AUC (up to 36% and 18% on average, respectively, with larger improvements in worst-case metrics) on 13 UCI datasets, demonstrating practical applicability for high-stakes domains with heterogeneous distribution shifts.

Abstract

Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem that seeks the model that will perform best against adversarial realizations of the data distribution drawn from a suitably constructed Wasserstein ambiguity set. Our model and solution approach differ from prior work in that we can capture settings where the likelihood of distribution shifts can vary across features, significantly broadening the applicability of our model relative to the state-of-the-art. We propose a graph-based solution approach that can be integrated into off-the-shelf optimization solvers. We evaluate the performance of our model and algorithms on numerous publicly available datasets. Our solution achieves a 408x speed-up relative to the state-of-the-art. Additionally, compared to the state-of-the-art, our model reduces average calibration error by up to 36.19% and worst-case calibration error by up to 41.70%, while increasing the average area under the ROC curve (AUC) by up to 18.02% and worst-case AUC by up to 48.37%.

Paper Structure

This paper contains 20 sections, 4 theorems, 77 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Without shifts in labels, the distributionally robust logistic regression problem eq:original formulation can be reformulated as where $\lambda$ and $\bm{r}$ are dual variables arising by dualizing the inner maximization problem in eq:original formulation. The constraints with log-loss functions can be converted into exponential cone format, resulting in a convex problem that can be solved with o

Figures (5)

  • Figure 1: A graph example with only two categorical features processed by the one-hot encoding: $\bm{z} = (\bm{z}_1, \bm{z}_2)$ where $\bm{z}_1$ has three possible realizations: $[1,0], [0,1], [0,0]$ and $\bm{z}_2$ has two possible realizations: $[1], [0]$. Given a data point with index $i$: $\bm{z}^i = (\bm{z}_1^i, \bm{z}_2^i) = ([0,0], [0])$. The weight parameters are $\bm{\delta} = (\delta_{1}, \delta_{2}) = (1,1)$.
  • Figure 2: Performance improvement compared to lasso logistic regression in terms of calibration error and AUC across different levels of robustness $\theta$ under expected perturbations. Blue boxes: proposed models with weight parameters rounded to integer; orange boxes: with weight parameters rounded to one decimal.
  • Figure 3: Performance improvement compared to distributionally robust logistic regression with all weight parameters set to 1 in terms of calibration error and AUC across different levels of robustness $\theta$ under expected perturbations. Blue boxes: proposed models with weight parameters rounded to integer; orange boxes: with weight parameters rounded to one decimal.
  • Figure 4: Overview of performance improvement compared to lasso-regularized logistic regression in terms of calibration error and AUC across different levels of robustness $\theta$ under unexpected perturbations. The blue boxes represent our proposed models with calibrated parameters rounded to integer. The orange boxes represent our proposed models with calibrated parameters rounded to one decimal place.
  • Figure 5: Overview of performance improvement compared to distributionally robust logistic regression with all weight parameters set to 1 in terms of calibration error and AUC across different levels of robustness $\theta$ under unexpected perturbations. The blue boxes represent our proposed models with calibrated parameters rounded to integer. The orange boxes represent our proposed models with calibrated parameters rounded to one decimal place.

Theorems & Definitions (9)

  • Definition 1: Wasserstein Distance
  • Definition 2: Weighted Distance Metric
  • Theorem 1: Convex Formulation
  • Lemma 1
  • Theorem 2: Graph-based Formulation
  • Lemma 2
  • proof
  • proof
  • proof