Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python

Ariel Lubonja; Sebastian Kazmarek Præsius; Trac Duy Tran

Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python

Ariel Lubonja, Sebastian Kazmarek Præsius, Trac Duy Tran

TL;DR

This work addresses sparse signal recovery via Orthogonal Matching Pursuit (OMP) and proposes a batched CPU/GPU implementation that leverages Cholesky-based updates to dramatically accelerate computation. Two algorithmic variants are presented: a Naïve dense Gramian approach and an inverse-Cholesky (v0) variant, both designed for batched operation and with careful attention to memory layout and BLAS/LAPACK-accelerated kernels. The authors report substantial speedups over Scikit-Learn, including up to roughly 200x faster performance on GPUs for large-scale problems, and provide detailed engineering improvements, benchmarks on the Yale dataset, and guidance for future optimization. The practical impact is enabling fast, scalable OMP-based sparse recovery in Python on CPU and GPU hardware, broadening applicability to real-time or batch-processing contexts.

Abstract

Finding the most sparse solution to the underdetermined system $\mathbf{y}=\mathbf{Ax}$, given a tolerance, is known to be NP-hard. A popular way to approximate a sparse solution is by using Greedy Pursuit algorithms, and Orthogonal Matching Pursuit (OMP) is one of the most widely used such solutions. For this paper, we implemented an efficient implementation of OMP that leverages Cholesky inverse properties as well as the power of Graphics Processing Units (GPUs) to deliver up to 200x speedup over the OMP implementation found in Scikit-Learn.

Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python

TL;DR

Abstract

Finding the most sparse solution to the underdetermined system

, given a tolerance, is known to be NP-hard. A popular way to approximate a sparse solution is by using Greedy Pursuit algorithms, and Orthogonal Matching Pursuit (OMP) is one of the most widely used such solutions. For this paper, we implemented an efficient implementation of OMP that leverages Cholesky inverse properties as well as the power of Graphics Processing Units (GPUs) to deliver up to 200x speedup over the OMP implementation found in Scikit-Learn.

Paper Structure (18 sections, 13 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Implementations of OMP
Our "Naïve" Algorithm
Algorithm v0
Implementation details and Engineering Tricks
Memory layout
Matrix batched-matrix products
Packed representation
Efficient batched argmax
Batched stopping criteria
Other possible optimizations
Benchmarks
Yale Face Classification (HW7)
Conclusion
Appendix
...and 3 more sections

Figures (3)

Figure 1: Relative time for running OMP.
Figure 2: Bar plot emphasizing the order-of-magniture difference in performance for Homework 7
Figure 3: Example $96x84$ image used for benchmarking (HW7)

Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python

TL;DR

Abstract

Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python

Authors

TL;DR

Abstract

Table of Contents

Figures (3)