Table of Contents
Fetching ...

Knowledge-Guided Wasserstein Distributionally Robust Optimization

Zitao Wang, Ziyuan Wang, Molei Liu, Nian Si

TL;DR

Knowledge-Guided Wasserstein Distributionally Robust Optimization (KG-WDRO) addresses the conservativeness of standard WDRO in transfer learning by guiding transport with prior knowledge through knowledge-informed directions. The authors establish a theoretical equivalence between KG-WDRO and shrinkage-based transfer estimators under collinear similarity, and provide tractable dual reformulations that support strong-/weak-transfer and multi-source extensions, including Mahalanobis-type metrics. The framework spans linear regression and binary classification, offering strong results under various loss functions and cost relaxations, and demonstrates superior performance in small-sample, multi-site, and high-dimensional settings. This work unifies several transfer-learning strategies within a distributionally robust perspective and provides practical mechanisms to adjust scaling and cross-source differences, with potential for data-driven hyperparameter tuning.

Abstract

Transfer learning is a popular strategy to leverage external knowledge and improve statistical efficiency, particularly with a limited target sample. We propose a novel knowledge-guided Wasserstein Distributionally Robust Optimization (KG-WDRO) framework that adaptively incorporates multiple sources of external knowledge to overcome the conservativeness of vanilla WDRO, which often results in overly pessimistic shrinkage toward zero. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning.

Knowledge-Guided Wasserstein Distributionally Robust Optimization

TL;DR

Knowledge-Guided Wasserstein Distributionally Robust Optimization (KG-WDRO) addresses the conservativeness of standard WDRO in transfer learning by guiding transport with prior knowledge through knowledge-informed directions. The authors establish a theoretical equivalence between KG-WDRO and shrinkage-based transfer estimators under collinear similarity, and provide tractable dual reformulations that support strong-/weak-transfer and multi-source extensions, including Mahalanobis-type metrics. The framework spans linear regression and binary classification, offering strong results under various loss functions and cost relaxations, and demonstrates superior performance in small-sample, multi-site, and high-dimensional settings. This work unifies several transfer-learning strategies within a distributionally robust perspective and provides practical mechanisms to adjust scaling and cross-source differences, with potential for data-driven hyperparameter tuning.

Abstract

Transfer learning is a popular strategy to leverage external knowledge and improve statistical efficiency, particularly with a limited target sample. We propose a novel knowledge-guided Wasserstein Distributionally Robust Optimization (KG-WDRO) framework that adaptively incorporates multiple sources of external knowledge to overcome the conservativeness of vanilla WDRO, which often results in overly pessimistic shrinkage toward zero. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning.

Paper Structure

This paper contains 36 sections, 15 theorems, 103 equations, 2 figures, 4 tables.

Key Result

Proposition 1

Let $c:\mathbb{R}^{d+1}\times \mathbb{R}^{d+1}\to [0,\infty]$ be a lower semi-continuous cost function satisfying $c((x,y),(u,v)) = 0$ whenever $(x,y) = (u,v)$. Then the distributionally robust regression problem is equivalent to, where $\phi_\gamma(x_i,y_i;\beta)$ is given by,

Figures (2)

  • Figure 1: The two-dimensional contour plots of the regularization term in Theorem \ref{['thm:linear_l2']} and Theorem \ref{['thm:weak_l2']} with $\lambda$ ranging from $+\infty$ to $2$ to $0.1$. The prior knowledge parameter is taken as $\theta = (2,1)^{ \sf T}$. The area between the black contours constitute a feasibility set of the regularization term when written in its equivalent constraint form. The feasibility set shrinks in the direction of $\theta$, to a circle of radius $K$ when $\lambda \to 0$ from above.
  • Figure 2: Out-of-sample performance plot of the proposed KG-WDRO method for high-dimensional regression tasks, compared against benchmark methods. The plot shows performance variations as $\rho$, representing the correlation between true and prior coefficient pairs, increases. Results are displayed for four specific settings across three experimental groups.

Theorems & Definitions (31)

  • Example 1
  • Proposition 1: Strong Duality,
  • Remark 1
  • Theorem 1: Linear Regression with Strong-Transferring
  • Remark 2
  • Theorem 2: Linear Regression with Weak Transferring
  • Theorem 3: Binary Classification with Strong Transferring
  • Corollary 1: Theorem \ref{['thm:linear_l2']}
  • Lemma A1
  • proof
  • ...and 21 more