Table of Contents
Fetching ...

Vec-QMDP: Vectorized POMDP Planning on CPUs for Real-Time Autonomous Driving

Xuanjin Jin, Yanxin Dong, Bin Sun, Huan Xu, Zhihui Hao, XianPeng Lang, Panpan Cai

TL;DR

Vec-QMDP introduces a CPU-native, SIMD-optimized POMDP planner for real-time autonomous driving by leveraging the QMDP approximation to decompose belief trees into independent sub-trees and applying data-oriented design to enable global and local vectorization. It couples multi-threading across CPU cores with wide SIMD kernels, and uses a load-balancing UCB to align expansion depth across scenario trees, achieving substantial speedups over serial planners. The approach also includes vectorized belief-space trajectory optimization with cross-scenario evaluation, supported by efficient collision checking via STR-trees and two-stage SIMD kernels. Evaluations on the nuPlan benchmark demonstrate millisecond-level planning and up to 1073× throughput gains in dense traffic, establishing CPUs as a high-performance platform for large-scale planning under uncertainty in autonomous driving.

Abstract

Planning under uncertainty for real-world robotics tasks, such as autonomous driving, requires reasoning in enormous high-dimensional belief spaces, rendering the problem computationally intensive. While parallelization offers scalability, existing hybrid CPU-GPU solvers face critical bottlenecks due to host-device synchronization latency and branch divergence on SIMT architectures, limiting their utility for real-time planning and hindering real-robot deployment. We present Vec-QMDP, a CPU-native parallel planner that aligns POMDP search with modern CPUs' SIMD architecture, achieving $227\times$--$1073\times$ speedup over state-of-the-art serial planners. Vec-QMDP adopts a Data-Oriented Design (DOD), refactoring scattered, pointer-based data structures into contiguous, cache-efficient memory layouts. We further introduce a hierarchical parallelism scheme: distributing sub-trees across independent CPU cores and SIMD lanes, enabling fully vectorized tree expansion and collision checking. Efficiency is maximized with the help of UCB load balancing across trees and a vectorized STR-tree for coarse-level collision checking. Evaluated on large-scale autonomous driving benchmarks, Vec-QMDP achieves state-of-the-art planning performance with millisecond-level latency, establishing CPUs as a high-performance computing platform for large-scale planning under uncertainty.

Vec-QMDP: Vectorized POMDP Planning on CPUs for Real-Time Autonomous Driving

TL;DR

Vec-QMDP introduces a CPU-native, SIMD-optimized POMDP planner for real-time autonomous driving by leveraging the QMDP approximation to decompose belief trees into independent sub-trees and applying data-oriented design to enable global and local vectorization. It couples multi-threading across CPU cores with wide SIMD kernels, and uses a load-balancing UCB to align expansion depth across scenario trees, achieving substantial speedups over serial planners. The approach also includes vectorized belief-space trajectory optimization with cross-scenario evaluation, supported by efficient collision checking via STR-trees and two-stage SIMD kernels. Evaluations on the nuPlan benchmark demonstrate millisecond-level planning and up to 1073× throughput gains in dense traffic, establishing CPUs as a high-performance platform for large-scale planning under uncertainty in autonomous driving.

Abstract

Planning under uncertainty for real-world robotics tasks, such as autonomous driving, requires reasoning in enormous high-dimensional belief spaces, rendering the problem computationally intensive. While parallelization offers scalability, existing hybrid CPU-GPU solvers face critical bottlenecks due to host-device synchronization latency and branch divergence on SIMT architectures, limiting their utility for real-time planning and hindering real-robot deployment. We present Vec-QMDP, a CPU-native parallel planner that aligns POMDP search with modern CPUs' SIMD architecture, achieving -- speedup over state-of-the-art serial planners. Vec-QMDP adopts a Data-Oriented Design (DOD), refactoring scattered, pointer-based data structures into contiguous, cache-efficient memory layouts. We further introduce a hierarchical parallelism scheme: distributing sub-trees across independent CPU cores and SIMD lanes, enabling fully vectorized tree expansion and collision checking. Efficiency is maximized with the help of UCB load balancing across trees and a vectorized STR-tree for coarse-level collision checking. Evaluated on large-scale autonomous driving benchmarks, Vec-QMDP achieves state-of-the-art planning performance with millisecond-level latency, establishing CPUs as a high-performance computing platform for large-scale planning under uncertainty.
Paper Structure (27 sections, 4 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 4 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Real-time belief tree search in complex urban environments. Vec-QMDP achieves millisecond-level planning by parallelizing belief tree search across $10,000+$ future scenarios, enabling the ego-vehicle to navigate interactive traffic and respond to high-risk intentions within $14\text{ms}$.
  • Figure 2: Overview of Vec-QMDP. (a) Sample the belief into $M\times N$ scenarios in an SoA layout. (b) Vectorized QMDP search: after the first action, scenario trees run in parallel on $M$ CPU threads; within each thread, SIMD global vectorization batches transition dynamics across scenarios and SIMD local vectorization accelerates within-node collision checks. (c) Vectorized trajectory optimization: generate candidates and use block-diagonal cross-scenario evaluation within minibatches to select $\tau^*$.
  • Figure 3: Two-stage SIMD collision checking. (a) Broad phase: SIMD AABB tests in the Frenet frame traverse a pointer-less STR-tree to prune candidates. (b) Narrow phase: SIMD SAT checks evaluate ego--agent pairs to compute collisions.
  • Figure 4: Tree construction throughput. (Left) Edges/ms vs. traffic density. (Right) Speedup over serial Hi-Drive (227$\times$--1073$\times$), increasing with density.
  • Figure 5: Ablation: multi-threading. (Left) Edges/ms vs. traffic density. (Right) Speedup over single-threaded (ST), showing near-linear scaling ($\sim$8$\times$).
  • ...and 2 more figures