Table of Contents
Fetching ...

Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning

Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang

TL;DR

This work investigates why multimodal in-context learning (ICL) with large vision-language models (LVLMs) is unstable and proposes a training-free solution, Context-Aware Modulated Attention (CAMA). By analyzing two key attention deficits—intra-ICD visual-text grounding in shallow layers and cross-ICD routing in middle layers—the authors design a two-stage modulation that enhances relevant image tokens and prioritizes query-relevant ICDs without parameter updates. Across 11 benchmarks and four LVLMs, CAMA yields consistent accuracy gains, activates benefits of prompt engineering, and generalizes to tasks beyond VQA, demonstrating robust, practical improvements in multimodal reasoning. The approach advances understanding of attention dynamics in LVLMs and offers a scalable, plug-in method to boost multimodal ICL performance in real-world settings.

Abstract

Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.

Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning

TL;DR

This work investigates why multimodal in-context learning (ICL) with large vision-language models (LVLMs) is unstable and proposes a training-free solution, Context-Aware Modulated Attention (CAMA). By analyzing two key attention deficits—intra-ICD visual-text grounding in shallow layers and cross-ICD routing in middle layers—the authors design a two-stage modulation that enhances relevant image tokens and prioritizes query-relevant ICDs without parameter updates. Across 11 benchmarks and four LVLMs, CAMA yields consistent accuracy gains, activates benefits of prompt engineering, and generalizes to tasks beyond VQA, demonstrating robust, practical improvements in multimodal reasoning. The approach advances understanding of attention dynamics in LVLMs and offers a scalable, plug-in method to boost multimodal ICL performance in real-world settings.

Abstract

Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.

Paper Structure

This paper contains 31 sections, 16 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: (a) Example of a 3-shot multimodal in-context sequence. (b)-(d) present the vanilla model, adding an instruction to the sequence, and our proposed method, CAMA, respectively. All attention heatmaps come from layer 18, and redder regions indicate stronger attention.
  • Figure 2: Layer-wise trends of the intra-ICD alignment score $s_{align}$ and and the key ICD contribution score $s_{contrib}$ in effective and ineffective multimodal ICL. Pos 1, 2, and 3 denote the key ICD position in the sequence.
  • Figure 3: An overview pipeline of CAMA. A version with more details is provided in Appendix 1.
  • Figure 4: Average performance of CAMA across four LVLMs and seven VQA benchmarks as the count of ICDs and sequence configuration strategies vary.
  • Figure 5: Average performance of CAMA on seven VQA benchmarks while varying $L_\text{stageI}$, $L_\text{stageII}$, $k_\text{I}$, and $k_\text{II}$. Random7 means randomly choosing seven layers.
  • ...and 3 more figures