Table of Contents
Fetching ...

Lightspeed Geometric Dataset Distance via Sliced Optimal Transport

Khai Nguyen, Hai Nguyen, Tuan Pham, Nhat Ho

TL;DR

This work targets the challenge of computing scalable, model- and embedding-agnostic distances between datasets. It introduces the Sliced Optimal Transport Dataset Distance (s-OTDD), built on Moment Transform Projection (MTP) to map a label distribution into a scalar via a one-dimensional feature projection and scaled moments, and then composes data-point projections to form one-dimensional representations. The distance is defined as the expected $W_p^p$ distance between projected distributions over random projections, is provably a metric under injectivity, and admits a Monte Carlo estimator with near-linear complexity in the number of data points and feature dimensions, independent of the number of classes. Empirically, s-OTDD closely tracks OTDD (Exact) while delivering substantial speedups and robust correlations with transfer-learning performance and augmentation effectiveness across image and text domains, making it well-suited for large-scale, distributed, or federated settings.

Abstract

We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.

Lightspeed Geometric Dataset Distance via Sliced Optimal Transport

TL;DR

This work targets the challenge of computing scalable, model- and embedding-agnostic distances between datasets. It introduces the Sliced Optimal Transport Dataset Distance (s-OTDD), built on Moment Transform Projection (MTP) to map a label distribution into a scalar via a one-dimensional feature projection and scaled moments, and then composes data-point projections to form one-dimensional representations. The distance is defined as the expected distance between projected distributions over random projections, is provably a metric under injectivity, and admits a Monte Carlo estimator with near-linear complexity in the number of data points and feature dimensions, independent of the number of classes. Empirically, s-OTDD closely tracks OTDD (Exact) while delivering substantial speedups and robust correlations with transfer-learning performance and augmentation effectiveness across image and text domains, making it well-suited for large-scale, distributed, or federated settings.

Abstract

We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.

Paper Structure

This paper contains 17 sections, 4 theorems, 28 equations, 13 figures, 1 algorithm.

Key Result

Proposition 1

For $\mu,\nu \in \mathcal{P}(\mathbb{R}^d)$ and a injectivive feature projection $\mathcal{FP}_\theta$ and a set $\Lambda \subset \mathbb{N}$, $\mathcal{MTP}_{\lambda,\theta}(\mu) = \mathcal{MTP}_{\lambda,\theta}(\nu)$ for all $\theta \in \mathbb{S}^{d-1}$ and $\lambda \in \Lambda$ implies $\mu=\nu$ with $m_{\theta,\mu,\lambda}$ is $\lambda$-th moment of $\mathcal{FP}_\theta \sharp \mu$ (similar w

Figures (13)

  • Figure 1: The figure shows distance correlation with OTDD (Exact) of OTDD (Gaussian approximation), WTE, CHSW, and s-OTDD.
  • Figure 2: The figure shows computational time of OTDD (Exact), OTDD (Gaussian approx), WTE, CHSW (1,000, 5,000, 10,000 projections), and s-OTDD (1,000, 5,000, 10,000 projections) when varying size of two datasets.
  • Figure 3: The figure shows Pearson correlations of s-OTDD with s-OTDD (50,000 projections) when varying number of projections from 1,000 to 50,000 in MNIST dataset and CIFAR10 dataset.
  • Figure 4: The figure shows correlations of OTDD (Exact) and s-OTDD (10,000 projections) with the performance gap when conducting transfer learning in *NIST datasets.
  • Figure 5: The figure shows correlations of OTDD (Exact) and s-OTDD (10,000 projections) with the performance when conducting transfer learning in text datasets.
  • ...and 8 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Proposition 1
  • Definition 3
  • Corollary 1
  • Definition 4
  • Proposition 2
  • Proposition 3