Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

Eloi Tanguy

Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

Eloi Tanguy

TL;DR

This work provides the first rigorous convergence guarantees for SGD when training neural networks with Sliced Wasserstein losses by framing the problem within non-smooth, non-convex optimization and using Clarke differential theory. It shows that interpolated fixed-step SGD trajectories approximate sub-gradient flows of the population loss $F$ and, under stronger noised-projected dynamics, converge to generalized critical points, thereby explaining practical convergence observations. The analysis hinges on piecewise smooth network maps, Lipschitz regularity, and the path-differentiable structure of the SW loss; results extend to $p$-SW orders under additional assumptions. While illuminating, the theory currently requires discrete input measures for the strongest results, and future work could generalize to non-discrete inputs and learned projections, as well as explore connections to SW flows and other OT-based losses.

Abstract

Optimal Transport has sparked vivid interest in recent years, in particular thanks to the Wasserstein distance, which provides a geometrically sensible and intuitive way of comparing probability measures. For computational reasons, the Sliced Wasserstein (SW) distance was introduced as an alternative to the Wasserstein distance, and has seen uses for training generative Neural Networks (NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed practically in such a setting, there is to our knowledge no theoretical guarantee for this observation. Leveraging recent works on convergence of SGD on non-smooth and non-convex functions by Bianchi et al. (2022), we aim to bridge that knowledge gap, and provide a realistic context under which fixed-step SGD trajectories for the SW loss on NN parameters converge. More precisely, we show that the trajectories approach the set of (sub)-gradient flow equations as the step decreases. Under stricter assumptions, we show a much stronger convergence result for noised and projected SGD schemes, namely that the long-run limits of the trajectories approach a set of generalised critical points of the loss function.

Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

TL;DR

and, under stronger noised-projected dynamics, converge to generalized critical points, thereby explaining practical convergence observations. The analysis hinges on piecewise smooth network maps, Lipschitz regularity, and the path-differentiable structure of the SW loss; results extend to

-SW orders under additional assumptions. While illuminating, the theory currently requires discrete input measures for the strongest results, and future work could generalize to non-discrete inputs and learned projections, as well as explore connections to SW flows and other OT-based losses.

Abstract

Paper Structure (32 sections, 11 theorems, 56 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 32 sections, 11 theorems, 56 equations, 1 figure, 1 table, 1 algorithm.

Introduction
Optimal Transport in Machine Learning
The Sliced Wasserstein Distance as an Alternative
Related Works
Contributions
Convergence of Interpolated SGD Under Practical Assumptions
Stronger Convergence Under Stricter Assumptions
Stochastic Gradient Descent with SW as Loss
Convergence of Interpolated SGD Trajectories on F
Convergence of Noised Projected SGD Schemes on F
Conclusion and Outlook
Table of Notations
Postponed Proofs
Proof of \ref{['prop:SW_Gamma']}
Background on Non-Smooth and Non-Convex Analysis
...and 17 more sections

Key Result

Proposition 1

The $(w_\theta(\cdot, Y))_{\theta \in \mathbb{S}^{{d_y}-1}}$ are uniformly locally Lipschitz discrete_sliced_loss Prop. 2.1. Let $K_w(r, X, Y) := 2n(r + \|X\|_{\infty, 2} + \|Y\|_{\infty, 2})$, for $X, Y \in \mathbb{R}^{n \times {d_y}}$ and $r>0$. Then $w_\theta(\cdot, Y)$ is $K_w(r, X, Y)$-Lipschit

Figures (1)

Figure : Training a NN on the $\mathrm{SW}$ loss with Stochastic Gradient Descent

Theorems & Definitions (20)

Proposition 1
Proposition 2
proof
Proposition 3
proof
Proposition 4
Theorem 1
Remark 1
Proposition 5
proof
...and 10 more

Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

TL;DR

Abstract

Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (20)