Table of Contents
Fetching ...

Mimic In-Context Learning for Multimodal Tasks

Yuchu Jiang, Jiale Fu, Chenduo Hao, Xinting Hu, Yingzhe Peng, Xin Geng, Xu Yang

TL;DR

This work reframes in-context learning for multimodal transformers as learning stable, query-dependent shift effects injected into attention heads. By inserting per-head shift vectors, deriving a query-dependent magnitude $\mu$, and enforcing a layer-wise alignment loss, MimIC achieves stronger generalization with far fewer demonstrations than traditional ICL and prior shift-based methods. Empirical results across VQAv2, OK-VQA, and COCO Caption on two open-source LMMs (Idefics-9b and Idefics2-8b-base) show consistent improvements over baselines, reduced hallucinations, and compatibility with LoRA for further gains. The approach demonstrates data-efficient ICL mimicry, speedups at inference, and broad transfer to additional tasks, highlighting practical impact for scalable multimodal reasoning without extensive fine-tuning.

Abstract

Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors" added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://github.com/Kamichanw/MimIC.

Mimic In-Context Learning for Multimodal Tasks

TL;DR

This work reframes in-context learning for multimodal transformers as learning stable, query-dependent shift effects injected into attention heads. By inserting per-head shift vectors, deriving a query-dependent magnitude , and enforcing a layer-wise alignment loss, MimIC achieves stronger generalization with far fewer demonstrations than traditional ICL and prior shift-based methods. Empirical results across VQAv2, OK-VQA, and COCO Caption on two open-source LMMs (Idefics-9b and Idefics2-8b-base) show consistent improvements over baselines, reduced hallucinations, and compatibility with LoRA for further gains. The approach demonstrates data-efficient ICL mimicry, speedups at inference, and broad transfer to additional tasks, highlighting practical impact for scalable multimodal reasoning without extensive fine-tuning.

Abstract

Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors" added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://github.com/Kamichanw/MimIC.

Paper Structure

This paper contains 26 sections, 6 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Sketches of shift effects from query space to answer space. (a) Traditional ICL induces the shift vector by ICDs, which is sensitive to ICD configurations, i.e., changing one ICD make prediction incorrect. (b) Previous shift vector-based methods insert a query-independent shift vector learned from a large training set, causing equal shift magnitude for diverse queries, which may make prediction incorrect. (c) MimIC assigns a unique query-dependent shift vector learned from fewer training samples after each attention head layer, shifting diverse magnitude for different queries, thus achieving stronger generalization ability.
  • Figure 2: Comparison of MimIC and previous shift vector based methods. (a) MimIC changes the attention mechanism for each head, which inserts a learnable shift vector $\bm{v}$ with a query-dependent magnitude $\mu$. (b) Previous methods insert the pre-calculated or learnable shift vector with a query-independent $\mu$ after FFN layer without changing the attention mechanism.
  • Figure 3: Overall training framework of MimIC. (a) The original LMM processes $k$ ICDs and query input as conventional ICL, generating hidden states $\bm{H}_1^\prime$ to $\bm{H}_N^\prime$ at each layer. (b) In MimIC LMM, only a single query input $\bm{X}$ is processed, producing shifted hidden states $\bm{H}_1$ to $\bm{H}_N$, which are aligned with the original hidden states via the alignment loss $\mathcal{L}_\text{align}$. Additionally, the logits of language modeling head is used to compute ground truth loss $\mathcal{L}_\text{gt}$. The yellow blocks represents MimIC attention heads.
  • Figure 4: Performance comparisons of trainable methods on two LMMs across VQAv2/OK-VQA with fewer training set size.
  • Figure 5: Performance of MimIC trained with varying ICD shots on Idefics-9b, with the shaded area indicating the standard deviation across 1, 4, 8, 16 and 32 shot settings.
  • ...and 2 more figures