Table of Contents
Fetching ...

Parallel-in-Time Kalman Smoothing Using Orthogonal Transformations

Shahaf Gargir, Sivan Toledo

TL;DR

The paper tackles the sequential bottleneck of Kalman smoothing by introducing a numerically-stable parallel-in-time smoother based on a specialized sparse QR factorization with an odd-even block permutation. Covariance information is recovered efficiently through a SelInv-based adaptation, enabling diagonal blocks of (R^T R)^{-1} to be computed in parallel. Implemented in C/C++ with Threading Building Blocks, the Odd-Even smoother scales well on multi-core servers, delivering up to 47x speedups on 64 cores, and generally outperforms the prior parallel-in-time approach by Särkkä and García-Fernández while maintaining numerical stability and flexibility (e.g., handling non-identity $H_i$ and unknown initial-state expectations). The work highlights the trade-offs between parallelism and arithmetic overhead, demonstrates practical performance, and provides open-source access to the implementation for further adoption in high-dimensional Kalman smoothing tasks.

Abstract

We present a numerically-stable parallel-in-time linear Kalman smoother. The smoother uses a novel highly-parallel QR factorization for a class of structured sparse matrices for state estimation, and an adaptation of the SelInv selective-inversion algorithm to evaluate the covariance matrices of estimated states. Our implementation of the new algorithm, using the Threading Building Blocks (TBB) library, scales well on both Intel and ARM multi-core servers, achieving speedups of up to 47x on 64 cores. The algorithm performs more arithmetic than sequential smoothers; consequently it is 1.8x to 2.5x slower on a single core. The new algorithm is faster and scales better than the parallel Kalman smoother proposed by Särkkä and García-Fernández in 2021.

Parallel-in-Time Kalman Smoothing Using Orthogonal Transformations

TL;DR

The paper tackles the sequential bottleneck of Kalman smoothing by introducing a numerically-stable parallel-in-time smoother based on a specialized sparse QR factorization with an odd-even block permutation. Covariance information is recovered efficiently through a SelInv-based adaptation, enabling diagonal blocks of (R^T R)^{-1} to be computed in parallel. Implemented in C/C++ with Threading Building Blocks, the Odd-Even smoother scales well on multi-core servers, delivering up to 47x speedups on 64 cores, and generally outperforms the prior parallel-in-time approach by Särkkä and García-Fernández while maintaining numerical stability and flexibility (e.g., handling non-identity and unknown initial-state expectations). The work highlights the trade-offs between parallelism and arithmetic overhead, demonstrates practical performance, and provides open-source access to the implementation for further adoption in high-dimensional Kalman smoothing tasks.

Abstract

We present a numerically-stable parallel-in-time linear Kalman smoother. The smoother uses a novel highly-parallel QR factorization for a class of structured sparse matrices for state estimation, and an adaptation of the SelInv selective-inversion algorithm to evaluate the covariance matrices of estimated states. Our implementation of the new algorithm, using the Threading Building Blocks (TBB) library, scales well on both Intel and ARM multi-core servers, achieving speedups of up to 47x on 64 cores. The algorithm performs more arithmetic than sequential smoothers; consequently it is 1.8x to 2.5x slower on a single core. The new algorithm is faster and scales better than the parallel Kalman smoother proposed by Särkkä and García-Fernández in 2021.

Paper Structure

This paper contains 16 sections, 29 equations, 6 figures, 2 algorithms.

Figures (6)

  • Figure 1: The structure of $R$ in the odd-even algorithm. The problem consisted of $k=50$ states. Each gray square represents an $n$-by-$n$ nonzero block.
  • Figure 2: Running Times of all the smoothers on a server with 64 physical cores (Graviton3) and on a server with 56 physical cores (2 Intel Xeon Gold 6238R CPUs).
  • Figure 3: Speedups of the parallel smoothers . The ratios are relative to the running time of the same implementation on 1 core. The graphs are based on the same data shown in Figure \ref{['fig:running-times']}.
  • Figure 4: Speedups of 4 phases of a representative but embarrassingly-parallel micro-benchmark.
  • Figure 5: Running times distributions of the Odd-Even algorithm on 1 core and on 28 cores. The histograms analyze 100 runs each. The horizontal spans are set to 20% of the median running time.
  • ...and 1 more figures