Robust First and Second-Order Differentiation for Regularized Optimal Transport

Xingjie Li; Fei Lu; Molei Tao; Felix X. -F. Ye

Robust First and Second-Order Differentiation for Regularized Optimal Transport

Xingjie Li, Fei Lu, Molei Tao, Felix X. -F. Ye

TL;DR

Through analytical derivation and spectral analysis, the numerical instability caused by the singularity and ill-posedness of a key linear system is identified and resolved, enabling the implementation of the stochastic gradient descent (SGD)-Newton methods.

Abstract

Applications such as unbalanced and fully shuffled regression can be approached by optimizing regularized optimal transport (OT) distances, such as the entropic OT and Sinkhorn distances. A common approach for this optimization is to use a first-order optimizer, which requires the gradient of the OT distance. For faster convergence, one might also resort to a second-order optimizer, which additionally requires the Hessian. The computations of these derivatives are crucial for efficient and accurate optimization. However, they present significant challenges in terms of memory consumption and numerical instability, especially for large datasets and small regularization strengths. We circumvent these issues by analytically computing the gradients for OT distances and the Hessian for the entropic OT distance, which was not previously used due to intricate tensor-wise calculations and the complex dependency on parameters within the bi-level loss function. Through analytical derivation and spectral analysis, we identify and resolve the numerical instability caused by the singularity and ill-posedness of a key linear system. Consequently, we achieve scalable and stable computation of the Hessian, enabling the implementation of the stochastic gradient descent (SGD)-Newton methods. Tests on shuffled regression examples demonstrate that the second stage of the SGD-Newton method converges orders of magnitude faster than the gradient descent-only method while achieving significantly more accurate parameter estimations.

Robust First and Second-Order Differentiation for Regularized Optimal Transport

TL;DR

Abstract

Paper Structure (26 sections, 13 theorems, 73 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 26 sections, 13 theorems, 73 equations, 4 figures, 1 table, 2 algorithms.

Introduction
Outline
Optimal Transport Loss and Sinkhorn Algorithm
Optimal Transport Loss Functions
Wasserstein-2 Metric
Entropy-regularized OT (EOT) Distance
Sinkhorn Distance
Sinkhorn Algorithm
Differentiation of Loss Functions
Previous methods computing first- and second-order derivatives
Analytical computation of the gradients
Computation of the Hessian
Solve the linear systems with truncated SVD
Spectral analysis of the H-matrix
General positive coupling matrices
...and 11 more sections

Key Result

Theorem 1

Suppose $\phi(\mathbf{x}, \mathbf{z})$ is a continuous function of two arguments, $\phi: \mathbb{R}^n\times Z\rightarrow \mathbb{R}$, where $Z\subset \mathbb{R}^m$ is a compact set and define $f(\mathbf{x})=\max_{\mathbf{z}\in Z}\phi(\mathbf{x}, \mathbf{z})$ and the set of maximizing points $\mathbf

Figures (4)

Figure 1: Decay of the smallest positive eigenvalue $\lambda_{2N-1}$ in $N$ and $\epsilon$. Equally spaced points on the unique circle: (a)$\lambda_{2N-1}\approx \frac{\epsilon}{4N}$ when $N > \frac{2\pi}{\sqrt{\epsilon}}$; (b)$\lambda_{2N-1}\approx 4\pi^2 r_{N,\epsilon}$ when $\epsilon< \frac{4\pi^2}{N^2}$. Uniformly distributed point cloud in unit square $[0,1]^2$: (c)$\lambda_{2N-1}=O(\frac{1}{N})$ when $N$ is large; (d)$\lambda_{2N-1}=O(e^{-\frac{1}{\epsilon}})$ when $\epsilon$ is small.
Figure 2: Comparison of runtime (in seconds) and marginal error for Hessian computing $\frac{d^2\text{OT}_\epsilon(\mathbf{C}, \boldsymbol{\mu}, \boldsymbol{\mu})}{d\mathbf{Y}^2}$ among three approaches: unroll, implicit differentiation and analytic expression with regularization (ours).
Figure 3: Shuffled Regression with Gaussian Mixtures.
Figure 4: 3D Point Cloud Registration.

Theorems & Definitions (25)

Theorem 1: Danskin's theorem
Theorem 2
proof
Proposition 3
proof
Theorem 4
proof
Proposition 5
proof
Theorem 6: Simple zero eigenvalue for the $\mathbf{H}$-matrix
...and 15 more

Robust First and Second-Order Differentiation for Regularized Optimal Transport

TL;DR

Abstract

Robust First and Second-Order Differentiation for Regularized Optimal Transport

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (25)