Table of Contents
Fetching ...

Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi

Abstract

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Abstract

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the proposed Look Twice (LoT). Visualization of the steps used to estimate which parts of the image and retrieved text are most relevant to the query.
  • Figure 2: Qualitative examples of our visual evidence selection pipeline (right). In the attention map (left), irrelevant visual tokens (orange) exhibit disproportionately high activations in specific hidden-state dimensions, while relevant tokens (blue) remain moderate. The BOS token, known to act as an attention sink in LLMs, shows a similar pattern. In LoT, tokens with sink score exceeding a threshold $\tau$ are filtered.
  • Figure 3: Qualitative examples of LoT highlighting query-relevant visual regions and textual evidence, enabling the model to generate the correct answer.
  • Figure 4: Performance on E-VQA as the number $n$ of retrieved passages varies (left) and with oracle Wikipedia evidence (right) for different MLLM backbones.
  • Figure 5: Qualitative examples of attention maps filtering.
  • ...and 2 more figures