Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning
Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang
TL;DR
This work investigates why multimodal in-context learning (ICL) with large vision-language models (LVLMs) is unstable and proposes a training-free solution, Context-Aware Modulated Attention (CAMA). By analyzing two key attention deficits—intra-ICD visual-text grounding in shallow layers and cross-ICD routing in middle layers—the authors design a two-stage modulation that enhances relevant image tokens and prioritizes query-relevant ICDs without parameter updates. Across 11 benchmarks and four LVLMs, CAMA yields consistent accuracy gains, activates benefits of prompt engineering, and generalizes to tasks beyond VQA, demonstrating robust, practical improvements in multimodal reasoning. The approach advances understanding of attention dynamics in LVLMs and offers a scalable, plug-in method to boost multimodal ICL performance in real-world settings.
Abstract
Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.
