Fast and Large-Scale Unbalanced Optimal Transport via its Semi-Dual and Adaptive Gradient Methods

Ferdinand Genans

Fast and Large-Scale Unbalanced Optimal Transport via its Semi-Dual and Adaptive Gradient Methods

Ferdinand Genans

TL;DR

This work tackles scalable optimization for Unbalanced OT (UOT) by analyzing the entropic semi-dual. It shows that the local geometry near the optimizer has a condition number scaling as O(1/ε), independent of the problem size n, which enables adaptive first-order methods. The authors develop PASGD for semi-discrete problems with convergence at O(n/(ε T)) and design ANAG for the discrete full-batch setting with a near-optimal local complexity of O(n^2 sqrt(1/ε) log(1/δ)). They provide a rigorous treatment of global and local curvature, generalized self-concordance, and data-dependent smoothness, along with extensive numerical demonstrations in color transfer and semi-discrete tasks. Overall, the paper delivers scalable, theory-backed solvers for large-scale UOT while highlighting the practical benefits of using a χ^2 target divergence over KL in the semi-dual.

Abstract

Unbalanced Optimal Transport (UOT) has emerged as a robust relaxation of standard Optimal Transport, particularly effective for handling outliers and mass variations. However, scalable algorithms for UOT, specifically those based on Gradient Descent (SGD), remain largely underexplored. In this work, we address this gap by analyzing the semi-dual formulation of Entropic UOT and demonstrating its suitability for adaptive gradient methods. While the semi-dual is a standard tool for large-scale balanced OT, its geometry in the unbalanced setting appears ill-conditioned under standard analysis. Specifically, worst-case bounds on the marginal penalties using $χ^2$ divergence suggest a condition number scaling with $n/\varepsilon$, implying poor scalability. In contrast, we show that the local condition number actually scales as $\mathcal{O}(1/\varepsilon)$, effectively removing the ill-conditioned dependence on $n$. Exploiting this property, we prove that SGD methods adapt to this local curvature, achieving a convergence rate of $\mathcal{O}(n/\varepsilon T)$ in the stochastic and online regimes, making it suitable for large-scale and semi-discrete applications. Finally, for the full batch discrete setting, we derive a nearly tight upper bound on local smoothness depending solely on the gradient. Using it to adapt step sizes, we propose a modified Adaptive Nesterov Accelerated Gradient (ANAG) method on the semi-dual functional and prove that it achieves a local complexity of $\mathcal{O}(n^2\sqrt{1/\varepsilon}\ln(1/δ))$.

Fast and Large-Scale Unbalanced Optimal Transport via its Semi-Dual and Adaptive Gradient Methods

TL;DR

Abstract

divergence suggest a condition number scaling with

, implying poor scalability. In contrast, we show that the local condition number actually scales as

, effectively removing the ill-conditioned dependence on

. Exploiting this property, we prove that SGD methods adapt to this local curvature, achieving a convergence rate of

in the stochastic and online regimes, making it suitable for large-scale and semi-discrete applications. Finally, for the full batch discrete setting, we derive a nearly tight upper bound on local smoothness depending solely on the gradient. Using it to adapt step sizes, we propose a modified Adaptive Nesterov Accelerated Gradient (ANAG) method on the semi-dual functional and prove that it achieves a local complexity of

Paper Structure (44 sections, 11 theorems, 179 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 44 sections, 11 theorems, 179 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Background on Unbalanced Optimal Transport
Dual optimizers and induced coupling.
Entropic UOT Semi-Dual: Derivation and Properties
Semi-Dual Formulation and Gradient
The First Order Condition Keystone
Global and Local Curvature
The Necessity of the Target $\chi^2$ Divergence.
Adaptive Gradient Descent on the Semi-Dual
Large-Scale and Semi-Discrete Settings
Unbiased Gradient Estimator.
Complexity.
The Online Regime.
Adaptivity and Convergence.
Numerical Experiments: Semi-Discrete Setting
...and 29 more sections

Key Result

Proposition 2

(Proof in Appendix appendix:semi_dual.) With $\alpha := \frac{\varepsilon}{\varepsilon + \rho_1}$, the semi-dual objective $\mathcal{J}: \mathbb{R}^n \to \mathbb{R}$ is Its gradient with respect to the $k$-th component is given by:

Figures (8)

Figure 1: PASGD vs. SGD. Convergence of the objective gap on a semi-discrete UOT problem ($n=2000$), with $\eta_t = C \frac{n}{\rho_2}(t+ 1/\varepsilon)^{-2/3}$, $C \in \{0.05, 1, 10\}$. PASGD confirms the $\mathcal{O}(1/T)$ rate and shows superior performance compared to SGD.
Figure 2: Effect of $\varepsilon$. Convergence profiles for varying entropic regularization levels. The objective gap (Left) reflects a practical dependence on $\varepsilon$, whereas the parameter error $\|\bar{\mathbf{g}}_t - \mathbf{g}^\star\|^2$ (Right) demonstrates higher robustness to the regularization parameter.
Figure 3: High-Resolution Color Transfer ($1024 \times 1024$). We transport the source color distribution (a) to the target geometry (b). The parameter $\rho$ controls the fidelity of the mass transfer. At $\rho=0.1$ (c), the relaxation allows for partial matching. At $\rho=10$ (e), the penalty enforces nearly balanced transport.
Figure 4: Smoothness Adaptive NAG (with safeguard restarts)
Figure 5: Scale Invariance and Adaptive Acceleration. Convergence on random measures with varying support sizes $n$ ($\varepsilon=0.01, \rho=10$). We compare ANAG against Adaptive GD and Conservative NAG (fixed step $1/L_{\text{global}}$). The overlap of ANAG curves confirms the dimension-independent local complexity, while the superiority of adaptive schemes highlights the benefit of local step sizes.
...and 3 more figures

Theorems & Definitions (11)

Proposition 2: Semi-Dual Objective and Gradient
Theorem 3: Smoothness Bound via Gradient Transport
Proposition 4: First-Order Optimality
Lemma 5: Uniform Gradient Bound and Smoothness
Corollary 6: Local Conditioning
Proposition 7: Generalized self-concordance
Proposition 8: Variance Bound of Mini-Batch Gradient
Theorem 9: Convergence of PASGD
Proposition 11: Asymmetric $L_0$--$L_1$ smoothness along a gradient step
Theorem 12: Adaptive NAG Convergence Rate
...and 1 more

Fast and Large-Scale Unbalanced Optimal Transport via its Semi-Dual and Adaptive Gradient Methods

TL;DR

Abstract

Fast and Large-Scale Unbalanced Optimal Transport via its Semi-Dual and Adaptive Gradient Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (11)