A Truncated Newton Method for Optimal Transport
Mete Kemertas, Amir-massoud Farahmand, Allan D. Jepson
TL;DR
The paper addresses scalable, high-precision discrete OT with entropic regularization by developing a GPU-friendly truncated Newton method for the EOT dual. It introduces a discounted Hessian formulation and a hybrid Newton-Sinkhorn projection that achieves superlinear local convergence without requiring a Lipschitz Hessian, aided by adaptive temperature annealing within the MDOT framework. Key contributions include a specialized linear conjugate gradient routine for the dual, a TruncatedNewtonProject that combines Newton solves with Sinkhorn projections, and an adaptive initialization strategy that eliminates hyperparameter tuning. Empirical results on large-scale, high-dimensional problems demonstrate orders-of-magnitude speedups in wall-clock time at high precision, with a memory-efficient variant enabling problems up to ${n oughly 10^6}$. The work offers a practical route to high-accuracy OT on modern GPUs and provides a foundation for future global-convergence and stochastic-memory-efficient extensions.
Abstract
Developing a contemporary optimal transport (OT) solver requires navigating trade-offs among several critical requirements: GPU parallelization, scalability to high-dimensional problems, theoretical convergence guarantees, empirical performance in terms of precision versus runtime, and numerical stability in practice. With these challenges in mind, we introduce a specialized truncated Newton algorithm for entropic-regularized OT. In addition to proving that locally quadratic convergence is possible without assuming a Lipschitz Hessian, we provide strategies to maximally exploit the high rate of local convergence in practice. Our GPU-parallel algorithm exhibits exceptionally favorable runtime performance, achieving high precision orders of magnitude faster than many existing alternatives. This is evidenced by wall-clock time experiments on 24 problem sets (12 datasets $\times$ 2 cost functions). The scalability of the algorithm is showcased on an extremely large OT problem with $n \approx 10^6$, solved approximately under weak entopric regularization.
