Neural Local Wasserstein Regression
Inga Girshfeld, Xiaohui Chen
TL;DR
The paper tackles distribution-on-distribution regression where both predictors and responses are probability measures, a setting where global optimal-transport maps and tangent-space linearizations can fail in high dimensions. It introduces Neural Local Wasserstein Regression, a nonparametric framework that learns covariate-dependent, locally defined transport maps in the $2$-Wasserstein space by combining kernel weights with neural parameterizations of transport operators. The approach employs DeepSets for distributional inputs and a U‑Net for image-like data, optimized via Sinkhorn-approximated $W_2$ losses and a data-driven bandwidth rule, enabling scalable local models around reference measures. Empirical results on Gaussian, Gaussian mixtures, and MNIST demonstrate that local transport captures nonlinear distributional relationships that global methods miss, with practical implications for high-dimensional distributional regression and robust geometry-aware learning.
Abstract
We study the estimation problem of distribution-on-distribution regression, where both predictors and responses are probability measures. Existing approaches typically rely on a global optimal transport map or tangent-space linearization, which can be restrictive in approximation capacity and distort geometry in multivariate underlying domains. In this paper, we propose the \emph{Neural Local Wasserstein Regression}, a flexible nonparametric framework that models regression through locally defined transport maps in Wasserstein space. Our method builds on the analogy with classical kernel regression: kernel weights based on the 2-Wasserstein distance localize estimators around reference measures, while neural networks parameterize transport operators that adapt flexibly to complex data geometries. This localized perspective broadens the class of admissible transformations and avoids the limitations of global map assumptions and linearization structures. We develop a practical training procedure using DeepSets-style architectures and Sinkhorn-approximated losses, combined with a greedy reference selection strategy for scalability. Through synthetic experiments on Gaussian and mixture models, as well as distributional prediction tasks on MNIST, we demonstrate that our approach effectively captures nonlinear and high-dimensional distributional relationships that elude existing methods.
