Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

Kairui Ding; Boyuan Chen; Yuchen Su; Huan-ang Gao; Bu Jin; Chonghao Sima; Wuqiang Zhang; Xiaohui Li; Paul Barsch; Hongyang Li; Hao Zhao

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

Kairui Ding, Boyuan Chen, Yuchen Su, Huan-ang Gao, Bu Jin, Chonghao Sima, Wuqiang Zhang, Xiaohui Li, Paul Barsch, Hongyang Li, Hao Zhao

TL;DR

Hint-AD tackles interpretability in end-to-end autonomous driving by grounding language generation in the full perception-prediction-planning pipeline, rather than relying on declarative, post-hoc explanations.It introduces a holistic token mixer, instance-level fusion, and a barbell adapter-based language decoder to align outputs with intermediate AD representations, and augments training with an online alignment task across counting, position, motion, and planning.The approach is validated on UniAD and VAD backbones, achieving state-of-the-art results on driving-language tasks (Nu-X CIDEr, TOD3Cap CIDEr) and improving QA and command accuracy, with the Nu-X dataset providing a large-scale, human-labeled driving-explanation resource.Together, these contributions advance interpretable, reliable human-AI interaction in autonomous driving and establish a path toward integrating aligned language models with complex perception-planning pipelines.

Abstract

End-to-end architectures in autonomous driving (AD) face a significant challenge in interpretability, impeding human-AI trust. Human-friendly natural language has been explored for tasks such as driving explanation and 3D captioning. However, previous works primarily focused on the paradigm of declarative interpretability, where the natural language interpretations are not grounded in the intermediate outputs of AD systems, making the interpretations only declarative. In contrast, aligned interpretability establishes a connection between language and the intermediate outputs of AD systems. Here we introduce Hint-AD, an integrated AD-language system that generates language aligned with the holistic perception-prediction-planning outputs of the AD model. By incorporating the intermediate outputs and a holistic token mixer sub-network for effective feature adaptation, Hint-AD achieves desirable accuracy, achieving state-of-the-art results in driving language tasks including driving explanation, 3D dense captioning, and command prediction. To facilitate further study on driving explanation task on nuScenes, we also introduce a human-labeled dataset, Nu-X. Codes, dataset, and models will be publicly available.

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

TL;DR

Abstract

Paper Structure (42 sections, 3 equations, 11 figures, 11 tables)

This paper contains 42 sections, 3 equations, 11 figures, 11 tables.

Introduction
Related Works
Methodology
Overall framework of Hint-AD
Holistic token mixer
Language decoder with barbell adaptation
Aligning language and intermediate outputs
Training pipeline
Experiments
Datasets and baselines
Datasets.
Baselines.
Comparing with baseline models
Quantitative results.
Qualitative Results.
...and 27 more sections

Figures (11)

Figure 1: Illustration of two paradigms for interpretability of end-to-end autonomous driving (AD) systems through natural language. (a) The declarative interpretability does not utilize intermediate outputs from AD systems, resulting in text that merely justifies the car's driving behavior; (b) Aligned interpretability incorporates intermediate outputs from the AD model to align the generated language with the holistic perception-prediction-planning process.
Figure 2: Framework of Hint-AD. (a) Hint-AD pipeline illustration. Taking intermediate output tokens from an AD pipeline as input, a language decoder generates natural language responses. A holistic token mixer module is designed to adapt the tokens. (b) Detailed illustration of BEV blocks architecture. (c) A detailed illustration of instance blocks architecture.
Figure 3: Qualitative Results. We present examples of the language output generated by Hint-AD across multiple tasks, including driving explanation, 3D dense captioning, VQA, command prediction, and four categories of alignment tasks. Captions that do not match the ground truth are colored in red.
Figure 4: Qualitative results on Nu-X and Command datasets. We choose TOD$^3$Cap as baseline model and present GPT-4o and Gemini-1.5 result.
Figure 5: Qualitative results on NuSence-QA with GPT-4o and Gemini-1.5 outputs.
...and 6 more figures

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

TL;DR

Abstract

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (11)