On The Performance of Prefix-Sum Parallel Kalman Filters and Smoothers on GPUs
Simo Särkkä, Ángel F. García-Fernández
TL;DR
This work empirically evaluates temporally parallel Kalman filtering and smoothing on GPUs using all-prefix-sum (scan) algorithms, comparing multiple prefix-sum strategies and introducing a novel parallel two-filter smoother (PTFS). It demonstrates that prefix-sum choice strongly influences practical performance, with Blelloch- and Ladner-Fischer-based scans yielding favorable speedups, and that a two-GPU PTFS can outperform single-GPU counterparts. The authors provide Julia-based Metal and CUDA implementations and show GPU speedups up to about 750× (Metal) and 500× (CUDA) relative to sequential Kalman filtering. The results validate that GPU-accelerated, temporally parallel Bayesian estimation can significantly accelerate real-time state estimation tasks while remaining close to theoretical operation counts.
Abstract
This paper presents an experimental evaluation of parallel-in-time Kalman filters and smoothers using graphics processing units (GPUs). In particular, the paper evaluates different all-prefix-sum algorithms, that is, parallel scan algorithms for temporal parallelization of Kalman filters and smoothers in two ways: by calculating the required number of operations via simulation, and by measuring the actual run time of the algorithms on real GPU hardware. In addition, a novel parallel-in-time two-filter smoother is proposed and experimentally evaluated. Julia code for Metal and CUDA implementations of all the algorithms is made publicly available.
