Table of Contents
Fetching ...

On The Performance of Prefix-Sum Parallel Kalman Filters and Smoothers on GPUs

Simo Särkkä, Ángel F. García-Fernández

TL;DR

This work empirically evaluates temporally parallel Kalman filtering and smoothing on GPUs using all-prefix-sum (scan) algorithms, comparing multiple prefix-sum strategies and introducing a novel parallel two-filter smoother (PTFS). It demonstrates that prefix-sum choice strongly influences practical performance, with Blelloch- and Ladner-Fischer-based scans yielding favorable speedups, and that a two-GPU PTFS can outperform single-GPU counterparts. The authors provide Julia-based Metal and CUDA implementations and show GPU speedups up to about 750× (Metal) and 500× (CUDA) relative to sequential Kalman filtering. The results validate that GPU-accelerated, temporally parallel Bayesian estimation can significantly accelerate real-time state estimation tasks while remaining close to theoretical operation counts.

Abstract

This paper presents an experimental evaluation of parallel-in-time Kalman filters and smoothers using graphics processing units (GPUs). In particular, the paper evaluates different all-prefix-sum algorithms, that is, parallel scan algorithms for temporal parallelization of Kalman filters and smoothers in two ways: by calculating the required number of operations via simulation, and by measuring the actual run time of the algorithms on real GPU hardware. In addition, a novel parallel-in-time two-filter smoother is proposed and experimentally evaluated. Julia code for Metal and CUDA implementations of all the algorithms is made publicly available.

On The Performance of Prefix-Sum Parallel Kalman Filters and Smoothers on GPUs

TL;DR

This work empirically evaluates temporally parallel Kalman filtering and smoothing on GPUs using all-prefix-sum (scan) algorithms, comparing multiple prefix-sum strategies and introducing a novel parallel two-filter smoother (PTFS). It demonstrates that prefix-sum choice strongly influences practical performance, with Blelloch- and Ladner-Fischer-based scans yielding favorable speedups, and that a two-GPU PTFS can outperform single-GPU counterparts. The authors provide Julia-based Metal and CUDA implementations and show GPU speedups up to about 750× (Metal) and 500× (CUDA) relative to sequential Kalman filtering. The results validate that GPU-accelerated, temporally parallel Bayesian estimation can significantly accelerate real-time state estimation tasks while remaining close to theoretical operation counts.

Abstract

This paper presents an experimental evaluation of parallel-in-time Kalman filters and smoothers using graphics processing units (GPUs). In particular, the paper evaluates different all-prefix-sum algorithms, that is, parallel scan algorithms for temporal parallelization of Kalman filters and smoothers in two ways: by calculating the required number of operations via simulation, and by measuring the actual run time of the algorithms on real GPU hardware. In addition, a novel parallel-in-time two-filter smoother is proposed and experimentally evaluated. Julia code for Metal and CUDA implementations of all the algorithms is made publicly available.

Paper Structure

This paper contains 35 sections, 2 theorems, 39 equations, 15 figures, 7 algorithms.

Key Result

Lemma 3

Given two elements $\left(A_{i},b_{i},C_{i},\eta_{i},J_{i}\right)$ and $\left(A_{j},b_{j},C_{j},\eta_{j},J_{j}\right)$ the binary operator $\otimes$ for filtering returns an element $\left(A_{i,j},b_{i,j},C_{i,j},\eta_{i,j},J_{i,j}\right)$ where

Figures (15)

  • Figure 1: Illustration of the operation of the algorithm of Hillis and Steele (Alg. \ref{['alg:hillis-steele']}) Hillis:1986 with 16 elements. Each column is a data storage element with $\otimes$ being the associative operator between two elements. The arrows indicate increasing time. At the final time step, the data elements contain the prefix sums.
  • Figure 2: Illustration of the operation of Blelloch's algorithm (Alg. \ref{['alg:blelloch']}) Blelloch:1990 with 16 elements.
  • Figure 3: Illustration of the operation of the in-place Ladner and Fischer's algorithm (Alg. \ref{['alg:inplace-lafi']}) Ladner:1980 with 16 elements.
  • Figure 4: Sketch of Julia implementation of generic up-sweep GPU kernel of a parallel associative scan.
  • Figure 5: Functions to compute the index and stride on Metal and CUDA.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Lemma 3
  • Lemma 4