Table of Contents
Fetching ...

Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python

Ariel Lubonja, Sebastian Kazmarek Præsius, Trac Duy Tran

TL;DR

This work addresses sparse signal recovery via Orthogonal Matching Pursuit (OMP) and proposes a batched CPU/GPU implementation that leverages Cholesky-based updates to dramatically accelerate computation. Two algorithmic variants are presented: a Naïve dense Gramian approach and an inverse-Cholesky (v0) variant, both designed for batched operation and with careful attention to memory layout and BLAS/LAPACK-accelerated kernels. The authors report substantial speedups over Scikit-Learn, including up to roughly 200x faster performance on GPUs for large-scale problems, and provide detailed engineering improvements, benchmarks on the Yale dataset, and guidance for future optimization. The practical impact is enabling fast, scalable OMP-based sparse recovery in Python on CPU and GPU hardware, broadening applicability to real-time or batch-processing contexts.

Abstract

Finding the most sparse solution to the underdetermined system $\mathbf{y}=\mathbf{Ax}$, given a tolerance, is known to be NP-hard. A popular way to approximate a sparse solution is by using Greedy Pursuit algorithms, and Orthogonal Matching Pursuit (OMP) is one of the most widely used such solutions. For this paper, we implemented an efficient implementation of OMP that leverages Cholesky inverse properties as well as the power of Graphics Processing Units (GPUs) to deliver up to 200x speedup over the OMP implementation found in Scikit-Learn.

Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python

TL;DR

This work addresses sparse signal recovery via Orthogonal Matching Pursuit (OMP) and proposes a batched CPU/GPU implementation that leverages Cholesky-based updates to dramatically accelerate computation. Two algorithmic variants are presented: a Naïve dense Gramian approach and an inverse-Cholesky (v0) variant, both designed for batched operation and with careful attention to memory layout and BLAS/LAPACK-accelerated kernels. The authors report substantial speedups over Scikit-Learn, including up to roughly 200x faster performance on GPUs for large-scale problems, and provide detailed engineering improvements, benchmarks on the Yale dataset, and guidance for future optimization. The practical impact is enabling fast, scalable OMP-based sparse recovery in Python on CPU and GPU hardware, broadening applicability to real-time or batch-processing contexts.

Abstract

Finding the most sparse solution to the underdetermined system , given a tolerance, is known to be NP-hard. A popular way to approximate a sparse solution is by using Greedy Pursuit algorithms, and Orthogonal Matching Pursuit (OMP) is one of the most widely used such solutions. For this paper, we implemented an efficient implementation of OMP that leverages Cholesky inverse properties as well as the power of Graphics Processing Units (GPUs) to deliver up to 200x speedup over the OMP implementation found in Scikit-Learn.
Paper Structure (18 sections, 13 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Relative time for running OMP.
  • Figure 2: Bar plot emphasizing the order-of-magniture difference in performance for Homework 7
  • Figure 3: Example $96x84$ image used for benchmarking (HW7)