Table of Contents
Fetching ...

MAP-UOT: A Memory-Efficient Approach to Unbalanced Optimal Transport Implementation

Chengyu Sun, Jinyu Hu, Hong Jiang

TL;DR

MAP-UOT identifies UOT as memory-bound using Roofline analysis and eliminates memory bottlenecks by interleaving row and column rescalings, achieving near-peak performance on CPUs, GPUs, and HPC systems. The approach redefines the iteration to a single double-loop per cycle, reducing memory traffic and enhancing cache locality, with tailored CPU and GPU implementations. Empirical results show substantial speedups over POT and COFFEE across platforms (up to 3.5X on GPUs and 7.2X scalability on CPUs), along with reduced memory footprint and improved throughput. The work demonstrates broad applicability to similar matrix-iteration problems and lays the groundwork for future extensions to sparse matrices and architecture-specific optimizations.

Abstract

Unbalanced optimal transport (UOT) has been widely used as a fundamental tool in many application domains, where it often dominates the application running time. While many researchers have proposed various optimizations for UOT, few have attempted to optimize it from a computer architecture's perspective. In this paper, we first study the performance bottlenecks of UOT through a series of experiments, which reveals that UOT is heavily memory-bound. Guided by these findings, we propose MAP-UOT, a Memory-efficient APproach to the implementation and optimization of UOT on CPU and GPU platforms. Our experimental evaluations show that the proposed strategy consistently and significantly outperforms the state-of-the-art (SOTA) implementations. Specifically, it provides single-threaded performance improvement over POT/COFFEE by up to 2.9X/2.4X, with an average of 1.9X/1.6X. At the same time, it provides parallelized performance improvement over POT/COFFEE by up to 2.4X/1.9X, with an average of 2.2X/1.8X, on Intel Core i9-12900K; and over POT by up to 3.5X, with an average of 1.6X, on Nvidia GeForce RTX 3090 Ti. MAP-UOT also shows great performance improvement on the Tianhe-1 supercomputer.

MAP-UOT: A Memory-Efficient Approach to Unbalanced Optimal Transport Implementation

TL;DR

MAP-UOT identifies UOT as memory-bound using Roofline analysis and eliminates memory bottlenecks by interleaving row and column rescalings, achieving near-peak performance on CPUs, GPUs, and HPC systems. The approach redefines the iteration to a single double-loop per cycle, reducing memory traffic and enhancing cache locality, with tailored CPU and GPU implementations. Empirical results show substantial speedups over POT and COFFEE across platforms (up to 3.5X on GPUs and 7.2X scalability on CPUs), along with reduced memory footprint and improved throughput. The work demonstrates broad applicability to similar matrix-iteration problems and lays the groundwork for future extensions to sparse matrices and architecture-specific optimizations.

Abstract

Unbalanced optimal transport (UOT) has been widely used as a fundamental tool in many application domains, where it often dominates the application running time. While many researchers have proposed various optimizations for UOT, few have attempted to optimize it from a computer architecture's perspective. In this paper, we first study the performance bottlenecks of UOT through a series of experiments, which reveals that UOT is heavily memory-bound. Guided by these findings, we propose MAP-UOT, a Memory-efficient APproach to the implementation and optimization of UOT on CPU and GPU platforms. Our experimental evaluations show that the proposed strategy consistently and significantly outperforms the state-of-the-art (SOTA) implementations. Specifically, it provides single-threaded performance improvement over POT/COFFEE by up to 2.9X/2.4X, with an average of 1.9X/1.6X. At the same time, it provides parallelized performance improvement over POT/COFFEE by up to 2.4X/1.9X, with an average of 2.2X/1.8X, on Intel Core i9-12900K; and over POT by up to 3.5X, with an average of 1.6X, on Nvidia GeForce RTX 3090 Ti. MAP-UOT also shows great performance improvement on the Tianhe-1 supercomputer.

Paper Structure

This paper contains 30 sections, 1 equation, 17 figures, 1 table, 3 algorithms.

Figures (17)

  • Figure 1: Implementation of the UOT algorithm with C language and Python language implementation demos.
  • Figure 2: The proportion of time occupied by the UOT algorithm among four applications (top) sun2023coffee and the proportion of time occupied by the UOT algorithm in the domain adaptation application (bottom).
  • Figure 3: Global memory Roofline model of UOT on 12900K and RTX 3090 Ti, respectively.
  • Figure 4: L1 and L2 cache miss rate of UOT on 12900K with Numpy implementation.
  • Figure 5: Global load/store throughput of UOT on RTX 3090 Ti with Cupy implementation.
  • ...and 12 more figures