HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder
Yingqi Tang, Zhuoran Xu, Zhaotie Meng, Erkang Cheng
TL;DR
HiP-AD addresses the gap between open-loop planning metrics and closed-loop driving performance by introducing hierarchical, multi-granularity planning queries and a planning deformable attention mechanism within a unified decoder that jointly handles perception, prediction, and planning. The approach enables rich planning representation (temporal, spatial, driving-style) and leverages geometric priors to sample image features near the planned trajectory, improving interaction with BEV and perspective views. Empirical results on Bench2Drive show state-of-the-art closed-loop performance, with competitive results on nuScenes, supported by extensive ablations that validate the benefits of multi-granularity planning and planning-driven feature retrieval. This work advances end-to-end autonomous driving by enhancing planning supervision, perception-planning interaction, and trajectory control in a single, differentiable framework.
Abstract
Although end-to-end autonomous driving (E2E-AD) technologies have made significant progress in recent years, there remains an unsatisfactory performance on closed-loop evaluation. The potential of leveraging planning in query design and interaction has not yet been fully explored. In this paper, we introduce a multi-granularity planning query representation that integrates heterogeneous waypoints, including spatial, temporal, and driving-style waypoints across various sampling patterns. It provides additional supervision for trajectory prediction, enhancing precise closed-loop control for the ego vehicle. Additionally, we explicitly utilize the geometric properties of planning trajectories to effectively retrieve relevant image features based on physical locations using deformable attention. By combining these strategies, we propose a novel end-to-end autonomous driving framework, termed HiP-AD, which simultaneously performs perception, prediction, and planning within a unified decoder. HiP-AD enables comprehensive interaction by allowing planning queries to iteratively interact with perception queries in the BEV space while dynamically extracting image features from perspective views. Experiments demonstrate that HiP-AD outperforms all existing end-to-end autonomous driving methods on the closed-loop benchmark Bench2Drive and achieves competitive performance on the real-world dataset nuScenes.
