Table of Contents
Fetching ...

Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices

Huy Tran, Yikun Bai, Ashkan Shahbazi, John R. Hershey, Soheil Kolouri

TL;DR

This work revisits the classical Sliced-Wasserstein and proposes instead to rescale the 1D Wasserstein to make all slices equally informative, and shows that with an appropriate data assumption and notion of slice informativeness, rescaling for all individual slices simplifies to a single global scaling factor on the SWD.

Abstract

The practical applications of Wasserstein distances (WDs) are constrained by their sample and computational complexities. Sliced-Wasserstein distances (SWDs) provide a workaround by projecting distributions onto one-dimensional subspaces, leveraging the more efficient, closed-form WDs for one-dimensional distributions. However, in high dimensions, most random projections become uninformative due to the concentration of measure phenomenon. Although several SWD variants have been proposed to focus on \textit{informative} slices, they often introduce additional complexity, numerical instability, and compromise desirable theoretical (metric) properties of SWD. Amidst the growing literature that focuses on directly modifying the slicing distribution, which often face challenges, we revisit the classical Sliced-Wasserstein and propose instead to rescale the 1D Wasserstein to make all slices equally informative. Importantly, we show that with an appropriate data assumption and notion of \textit{slice informativeness}, rescaling for all individual slices simplifies to \textbf{a single global scaling factor} on the SWD. This, in turn, translates to the standard learning rate search for gradient-based learning in common machine learning workflows. We perform extensive experiments across various machine learning tasks showing that the classical SWD, when properly configured, can often match or surpass the performance of more complex variants. We then answer the following question: "Is Sliced-Wasserstein all you need for common learning tasks?"

Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices

TL;DR

This work revisits the classical Sliced-Wasserstein and proposes instead to rescale the 1D Wasserstein to make all slices equally informative, and shows that with an appropriate data assumption and notion of slice informativeness, rescaling for all individual slices simplifies to a single global scaling factor on the SWD.

Abstract

The practical applications of Wasserstein distances (WDs) are constrained by their sample and computational complexities. Sliced-Wasserstein distances (SWDs) provide a workaround by projecting distributions onto one-dimensional subspaces, leveraging the more efficient, closed-form WDs for one-dimensional distributions. However, in high dimensions, most random projections become uninformative due to the concentration of measure phenomenon. Although several SWD variants have been proposed to focus on \textit{informative} slices, they often introduce additional complexity, numerical instability, and compromise desirable theoretical (metric) properties of SWD. Amidst the growing literature that focuses on directly modifying the slicing distribution, which often face challenges, we revisit the classical Sliced-Wasserstein and propose instead to rescale the 1D Wasserstein to make all slices equally informative. Importantly, we show that with an appropriate data assumption and notion of \textit{slice informativeness}, rescaling for all individual slices simplifies to \textbf{a single global scaling factor} on the SWD. This, in turn, translates to the standard learning rate search for gradient-based learning in common machine learning workflows. We perform extensive experiments across various machine learning tasks showing that the classical SWD, when properly configured, can often match or surpass the performance of more complex variants. We then answer the following question: "Is Sliced-Wasserstein all you need for common learning tasks?"

Paper Structure

This paper contains 35 sections, 16 theorems, 79 equations, 33 figures, 4 tables.

Key Result

Proposition 4.7

Under Assumption asp:lowdsupp, let $\mu^k = U_\# \mu^d$ and $\nu^k = U_\# \nu^d$ be the pushforward measures in $\mathbb{R}^k$. Then, for any $\theta^d \in \mathbb{S}^{d-1}$, we have that: where $\theta^k=\frac{U^\top\theta^d}{\|U^\top\theta^d\|}$ with convention $\theta^k=0_k$ if $\|U^\top \theta^d\|=0$. Furthermore, we have that: Here, we adopt the convention $\frac{1}{0}\cdot 0=0$ in eq:sw_d_

Figures (33)

  • Figure 1: Rescaling the 1D Wasserstein based on slice informativeness.
  • Figure 2: Minibatches of $d$-dimensional data, with $B$ source and $B$ target samples, reside in a linear subspace with dimensionality at most $k = \min\{2B-1, d\}$ when centered.
  • Figure 3: Left: Illustration of two $k$-dimensional Gaussian distributions embedded in $\mathbb{R}^d$ ($500$ samples each). Top row: Empirical ratios $\widehat{C}$ with varying $d$ for $k=2$ and $p=1,2$. Bottom row: Empirical ratios $\widehat{C}$ with varying $k$ for in $d=1000$ and $p=1,2$.
  • Figure 4: The mean and standard deviation (shaded area) of $\widehat{ESSF}(L)$ for varying $d,k$ over $1000$ independent runs for $p=1$ (left) and $p=2$ (right).
  • Figure 5: Classic synthetic 2D datasets (shown) embedded in spaces of different target dimensions.
  • ...and 28 more figures

Theorems & Definitions (43)

  • Definition 4.2: Informative slices
  • Definition 4.3: ES-aligned informative slices
  • Remark 4.4
  • Example 4.5
  • Remark 4.6
  • Proposition 4.7
  • Remark 4.8: Implicit Downweighting
  • Theorem 4.9: Effective Subspace Scaling Factor
  • Proposition 4.10
  • Remark 4.11
  • ...and 33 more