Table of Contents
Fetching ...

An Open-Source Modular Benchmark for Diffusion-Based Motion Planning in Closed-Loop Autonomous Driving

Yun Li, Simon Thompson, Yidu Zhang, Ehsan Javanmardi, Manabu Tsukada

TL;DR

An open-source modular benchmark that decomposes a monolithic 18,398 node diffusion planner into three independently executable modules and reimplement the DPM-Solver++ denoising loop in native C++, breaking the black box of monolithic deployment is presented.

Abstract

Diffusion-based motion planners have achieved state-of-the-art results on benchmarks such as nuPlan, yet their evaluation within closed-loop production autonomous driving stacks remains largely unexplored. Existing evaluations abstract away ROS 2 communication latency and real-time scheduling constraints, while monolithic ONNX deployment freezes all solver parameters at export time. We present an open-source modular benchmark that addresses both gaps: using ONNX GraphSurgeon, we decompose a monolithic 18,398 node diffusion planner into three independently executable modules and reimplement the DPM-Solver++ denoising loop in native C++. Integrated as a ROS 2 node within Autoware, the open-source AD stack deployed on real vehicles worldwide, the system enables runtime-configurable solver parameters without model recompilation and per-step observability of the denoising process, breaking the black box of monolithic deployment. Unlike evaluations in standalone simulators such as CARLA, our benchmark operates within a production-grade stack and is validated through AWSIM closed-loop simulation. Through systematic comparison of DPM-Solver++ (first- and second-order) and DDIM across six step-count configurations (N in {3, 5, 7, 10, 15, 20}), we show that encoder caching yields a 3.2x latency reduction, and that second-order solving reduces FDE by 41% at N=3 compared to first-order. The complete codebase will be released as open-source, providing a direct path from simulation benchmarks to real-vehicle deployment.

An Open-Source Modular Benchmark for Diffusion-Based Motion Planning in Closed-Loop Autonomous Driving

TL;DR

An open-source modular benchmark that decomposes a monolithic 18,398 node diffusion planner into three independently executable modules and reimplement the DPM-Solver++ denoising loop in native C++, breaking the black box of monolithic deployment is presented.

Abstract

Diffusion-based motion planners have achieved state-of-the-art results on benchmarks such as nuPlan, yet their evaluation within closed-loop production autonomous driving stacks remains largely unexplored. Existing evaluations abstract away ROS 2 communication latency and real-time scheduling constraints, while monolithic ONNX deployment freezes all solver parameters at export time. We present an open-source modular benchmark that addresses both gaps: using ONNX GraphSurgeon, we decompose a monolithic 18,398 node diffusion planner into three independently executable modules and reimplement the DPM-Solver++ denoising loop in native C++. Integrated as a ROS 2 node within Autoware, the open-source AD stack deployed on real vehicles worldwide, the system enables runtime-configurable solver parameters without model recompilation and per-step observability of the denoising process, breaking the black box of monolithic deployment. Unlike evaluations in standalone simulators such as CARLA, our benchmark operates within a production-grade stack and is validated through AWSIM closed-loop simulation. Through systematic comparison of DPM-Solver++ (first- and second-order) and DDIM across six step-count configurations (N in {3, 5, 7, 10, 15, 20}), we show that encoder caching yields a 3.2x latency reduction, and that second-order solving reduces FDE by 41% at N=3 compared to first-order. The complete codebase will be released as open-source, providing a direct path from simulation benchmarks to real-vehicle deployment.
Paper Structure (19 sections, 6 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Monolithic ONNX graph (18,398 nodes) compiled to TensorRT. (b) Modular decomposition: three independently executable modules with the DPM-Solver++ loop in C++; the context encoder runs once and is cached across all denoising steps.
  • Figure 2: Closed-loop evaluation environment. (a) RViz visualization showing the HD map, point cloud, and planned trajectory within the Autoware stack. (b) AWSIM 3D simulation at an urban intersection in the Nishishinjuku map, with the ego vehicle and sensor perception rays.
  • Figure 3: Denoising progression of the 8s-horizon waypoint. (a) Per-step predicted position in normalized coordinates: longitudinal (blue) and lateral (orange); three phases are visible: coarse (steps 1--2), refinement (3--6), and fine adjustment (7--10). (b) Relative error to the converged step-10 prediction.
  • Figure 4: Latency--accuracy Pareto frontier. Horizontal gap: encoder caching benefit; vertical gap: solver order benefit. Dashed line: 100 ms planning budget.
  • Figure 5: Solver comparison. (a) FDE and (b) ADE vs. $N{=}10$, $p{=}2$ reference. Shaded: $\pm 1\sigma$.