MAP-UOT: A Memory-Efficient Approach to Unbalanced Optimal Transport Implementation
Chengyu Sun, Jinyu Hu, Hong Jiang
TL;DR
MAP-UOT identifies UOT as memory-bound using Roofline analysis and eliminates memory bottlenecks by interleaving row and column rescalings, achieving near-peak performance on CPUs, GPUs, and HPC systems. The approach redefines the iteration to a single double-loop per cycle, reducing memory traffic and enhancing cache locality, with tailored CPU and GPU implementations. Empirical results show substantial speedups over POT and COFFEE across platforms (up to 3.5X on GPUs and 7.2X scalability on CPUs), along with reduced memory footprint and improved throughput. The work demonstrates broad applicability to similar matrix-iteration problems and lays the groundwork for future extensions to sparse matrices and architecture-specific optimizations.
Abstract
Unbalanced optimal transport (UOT) has been widely used as a fundamental tool in many application domains, where it often dominates the application running time. While many researchers have proposed various optimizations for UOT, few have attempted to optimize it from a computer architecture's perspective. In this paper, we first study the performance bottlenecks of UOT through a series of experiments, which reveals that UOT is heavily memory-bound. Guided by these findings, we propose MAP-UOT, a Memory-efficient APproach to the implementation and optimization of UOT on CPU and GPU platforms. Our experimental evaluations show that the proposed strategy consistently and significantly outperforms the state-of-the-art (SOTA) implementations. Specifically, it provides single-threaded performance improvement over POT/COFFEE by up to 2.9X/2.4X, with an average of 1.9X/1.6X. At the same time, it provides parallelized performance improvement over POT/COFFEE by up to 2.4X/1.9X, with an average of 2.2X/1.8X, on Intel Core i9-12900K; and over POT by up to 3.5X, with an average of 1.6X, on Nvidia GeForce RTX 3090 Ti. MAP-UOT also shows great performance improvement on the Tianhe-1 supercomputer.
