Table of Contents
Fetching ...

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

Hao Zhou, Zhanning Gao, Zhili Chen, Maosheng Ye, Qifeng Chen, Tongyi Cao, Honggang Qi

TL;DR

HoP tackles the shortfall of general multimodal LLMs in autonomous driving by introducing three hierarchical hints—Affinity, Semantic, and Question—that enrich visual representations and align them with driving-specific queries. A simple Hint Fusion module fuses these hints with CLIP visual tokens, enabling rapid domain adaptation with limited data and an adapter-LLM pipeline. An Efficient HoP variant further reduces latency by distilling lightweight hint-heads without sacrificing accuracy. Experiments on LingoQA, DRAMA, and BDD-X demonstrate state-of-the-art performance and strong data efficiency, validating the practicality of multi-level hints for safety-critical VQA in autonomous driving.

Abstract

In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

TL;DR

HoP tackles the shortfall of general multimodal LLMs in autonomous driving by introducing three hierarchical hints—Affinity, Semantic, and Question—that enrich visual representations and align them with driving-specific queries. A simple Hint Fusion module fuses these hints with CLIP visual tokens, enabling rapid domain adaptation with limited data and an adapter-LLM pipeline. An Efficient HoP variant further reduces latency by distilling lightweight hint-heads without sacrificing accuracy. Experiments on LingoQA, DRAMA, and BDD-X demonstrate state-of-the-art performance and strong data efficiency, validating the practicality of multi-level hints for safety-critical VQA in autonomous driving.

Abstract

In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.

Paper Structure

This paper contains 13 sections, 5 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of the Hints of Prompt (HoP) Framework. HoP enhances Visual Question Answering (VQA) in autonomous driving by incorporating three hierarchical hints: Affinity, Semantic, and Question. The Affinity hint provides foundational instance-level structures through token-wise connections, aiding in instance boundary and interaction recognition. Building on this, the Semantic hint introduces specific instances along with their category information, adding essential driving-related contexts, such as vehicles and traffic signs. Finally, the Question hint guides the LLM’s attention toward image regions pertinent to the question. These hints are fused with visual tokens through a simple Hint Fusion module, aligned via an adapter, and then processed by the LLM to generate answers.
  • Figure 2: Performance under different training data ratio on the LingoQA dataset. HoP surpasses the full-data performance of LLaVA-v1.5 using only 25% of the training data.
  • Figure 3: Visualization of token affinity from CLIP clip and DINOv2 dinov2. Similar colors indicate higher affinity scores, with color values derived from a PCA-reduced token vector space. DINOv2 token-wise similarity denotes tokens embedded only from the similarity matrix.
  • Figure 4: The impact of the three proposed hint types compared to the baseline and attention maps with/without these hints.
  • Figure 5: Different fusion strategies for the Hints Fusion module. The joint cross-attention strategy demonstrates the best performance. Residual connections are omitted for simplicity.
  • ...and 3 more figures