Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
Hao Zhou, Zhanning Gao, Zhili Chen, Maosheng Ye, Qifeng Chen, Tongyi Cao, Honggang Qi
TL;DR
HoP tackles the shortfall of general multimodal LLMs in autonomous driving by introducing three hierarchical hints—Affinity, Semantic, and Question—that enrich visual representations and align them with driving-specific queries. A simple Hint Fusion module fuses these hints with CLIP visual tokens, enabling rapid domain adaptation with limited data and an adapter-LLM pipeline. An Efficient HoP variant further reduces latency by distilling lightweight hint-heads without sacrificing accuracy. Experiments on LingoQA, DRAMA, and BDD-X demonstrate state-of-the-art performance and strong data efficiency, validating the practicality of multi-level hints for safety-critical VQA in autonomous driving.
Abstract
In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.
