Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python
Ariel Lubonja, Sebastian Kazmarek Præsius, Trac Duy Tran
TL;DR
This work addresses sparse signal recovery via Orthogonal Matching Pursuit (OMP) and proposes a batched CPU/GPU implementation that leverages Cholesky-based updates to dramatically accelerate computation. Two algorithmic variants are presented: a Naïve dense Gramian approach and an inverse-Cholesky (v0) variant, both designed for batched operation and with careful attention to memory layout and BLAS/LAPACK-accelerated kernels. The authors report substantial speedups over Scikit-Learn, including up to roughly 200x faster performance on GPUs for large-scale problems, and provide detailed engineering improvements, benchmarks on the Yale dataset, and guidance for future optimization. The practical impact is enabling fast, scalable OMP-based sparse recovery in Python on CPU and GPU hardware, broadening applicability to real-time or batch-processing contexts.
Abstract
Finding the most sparse solution to the underdetermined system $\mathbf{y}=\mathbf{Ax}$, given a tolerance, is known to be NP-hard. A popular way to approximate a sparse solution is by using Greedy Pursuit algorithms, and Orthogonal Matching Pursuit (OMP) is one of the most widely used such solutions. For this paper, we implemented an efficient implementation of OMP that leverages Cholesky inverse properties as well as the power of Graphics Processing Units (GPUs) to deliver up to 200x speedup over the OMP implementation found in Scikit-Learn.
