Mimic In-Context Learning for Multimodal Tasks
Yuchu Jiang, Jiale Fu, Chenduo Hao, Xinting Hu, Yingzhe Peng, Xin Geng, Xu Yang
TL;DR
This work reframes in-context learning for multimodal transformers as learning stable, query-dependent shift effects injected into attention heads. By inserting per-head shift vectors, deriving a query-dependent magnitude $\mu$, and enforcing a layer-wise alignment loss, MimIC achieves stronger generalization with far fewer demonstrations than traditional ICL and prior shift-based methods. Empirical results across VQAv2, OK-VQA, and COCO Caption on two open-source LMMs (Idefics-9b and Idefics2-8b-base) show consistent improvements over baselines, reduced hallucinations, and compatibility with LoRA for further gains. The approach demonstrates data-efficient ICL mimicry, speedups at inference, and broad transfer to additional tasks, highlighting practical impact for scalable multimodal reasoning without extensive fine-tuning.
Abstract
Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors" added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://github.com/Kamichanw/MimIC.
