Table of Contents
Fetching ...

HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder

Yingqi Tang, Zhuoran Xu, Zhaotie Meng, Erkang Cheng

TL;DR

HiP-AD addresses the gap between open-loop planning metrics and closed-loop driving performance by introducing hierarchical, multi-granularity planning queries and a planning deformable attention mechanism within a unified decoder that jointly handles perception, prediction, and planning. The approach enables rich planning representation (temporal, spatial, driving-style) and leverages geometric priors to sample image features near the planned trajectory, improving interaction with BEV and perspective views. Empirical results on Bench2Drive show state-of-the-art closed-loop performance, with competitive results on nuScenes, supported by extensive ablations that validate the benefits of multi-granularity planning and planning-driven feature retrieval. This work advances end-to-end autonomous driving by enhancing planning supervision, perception-planning interaction, and trajectory control in a single, differentiable framework.

Abstract

Although end-to-end autonomous driving (E2E-AD) technologies have made significant progress in recent years, there remains an unsatisfactory performance on closed-loop evaluation. The potential of leveraging planning in query design and interaction has not yet been fully explored. In this paper, we introduce a multi-granularity planning query representation that integrates heterogeneous waypoints, including spatial, temporal, and driving-style waypoints across various sampling patterns. It provides additional supervision for trajectory prediction, enhancing precise closed-loop control for the ego vehicle. Additionally, we explicitly utilize the geometric properties of planning trajectories to effectively retrieve relevant image features based on physical locations using deformable attention. By combining these strategies, we propose a novel end-to-end autonomous driving framework, termed HiP-AD, which simultaneously performs perception, prediction, and planning within a unified decoder. HiP-AD enables comprehensive interaction by allowing planning queries to iteratively interact with perception queries in the BEV space while dynamically extracting image features from perspective views. Experiments demonstrate that HiP-AD outperforms all existing end-to-end autonomous driving methods on the closed-loop benchmark Bench2Drive and achieves competitive performance on the real-world dataset nuScenes.

HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder

TL;DR

HiP-AD addresses the gap between open-loop planning metrics and closed-loop driving performance by introducing hierarchical, multi-granularity planning queries and a planning deformable attention mechanism within a unified decoder that jointly handles perception, prediction, and planning. The approach enables rich planning representation (temporal, spatial, driving-style) and leverages geometric priors to sample image features near the planned trajectory, improving interaction with BEV and perspective views. Empirical results on Bench2Drive show state-of-the-art closed-loop performance, with competitive results on nuScenes, supported by extensive ablations that validate the benefits of multi-granularity planning and planning-driven feature retrieval. This work advances end-to-end autonomous driving by enhancing planning supervision, perception-planning interaction, and trajectory control in a single, differentiable framework.

Abstract

Although end-to-end autonomous driving (E2E-AD) technologies have made significant progress in recent years, there remains an unsatisfactory performance on closed-loop evaluation. The potential of leveraging planning in query design and interaction has not yet been fully explored. In this paper, we introduce a multi-granularity planning query representation that integrates heterogeneous waypoints, including spatial, temporal, and driving-style waypoints across various sampling patterns. It provides additional supervision for trajectory prediction, enhancing precise closed-loop control for the ego vehicle. Additionally, we explicitly utilize the geometric properties of planning trajectories to effectively retrieve relevant image features based on physical locations using deformable attention. By combining these strategies, we propose a novel end-to-end autonomous driving framework, termed HiP-AD, which simultaneously performs perception, prediction, and planning within a unified decoder. HiP-AD enables comprehensive interaction by allowing planning queries to iteratively interact with perception queries in the BEV space while dynamically extracting image features from perspective views. Experiments demonstrate that HiP-AD outperforms all existing end-to-end autonomous driving methods on the closed-loop benchmark Bench2Drive and achieves competitive performance on the real-world dataset nuScenes.

Paper Structure

This paper contains 24 sections, 9 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Comparison of existing state-of-art works on open-loop metric of Collision Rate on nuScenes dataset and closed-loop metric of Success Rate on Bench2Drive dataset, where top left is better. The legend indicates different planning interaction methods.
  • Figure 1: Illustration of open-loop results on Bench2Drive validation dataset.
  • Figure 2: This diagram compares earlier methods (a-b) for predicting waypoints with our proposed multi-granularity planning design (c), where $n_t$, $n_s$, and $n_d$ represent different number of granularity in each waypoints type in terms of frequency, interval, and speed. Part (d) illustrates the evolution of hierarchical waypoints with instantiated granularity based on different sampling strategies.
  • Figure 2: Illustration of open-loop results on nuScenes validation dataset.
  • Figure 3: The overall framework of HiP-AD. It consists of a Backbone and a FPN for extracting image features, a unified decoder for iteratively updating query, and various heads for multi-task prediction. The inputs of the unified decoder are task anchors and queries (agent, map, and planning), where planning query consist of multi-granularity waypoints representations. In each unified decoder layer, the task queries first interact with temporal query separately, then collaboratively with each other, and finally engage the image features in an iterative manner. Last, the updated task queries are sent to the corresponding heads for perception, prediction, and planning, where planning results including various waypoints with different granularity for precise action control.
  • ...and 4 more figures