Table of Contents
Fetching ...

Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

Ziyu Ma, Chenhui Gou, Yiming Hu, Yong Wang, Xiangxiang Chu, Bohan Zhuang, Jianfei Cai

TL;DR

This work tackles the challenges of many-shot multimodal in-context learning by avoiding longer inputs and parameter updates. It introduces STV, a two-stage framework that first identifies context-sensitive insertion points within attention heads via activation deltas, then uses reinforcement learning to pick task vectors from a per-location activation bank for insertion. Empirical results across five vision-language benchmarks and two large multimodal model families show that STV consistently outperforms prior task-vector methods like MTV while drastically reducing insertion-search cost and preserving generalization. The approach offers a scalable, efficient pathway to leverage large multimodal models for many-shot ICL without finetuning or token-heavy prompts.

Abstract

Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.

Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

TL;DR

This work tackles the challenges of many-shot multimodal in-context learning by avoiding longer inputs and parameter updates. It introduces STV, a two-stage framework that first identifies context-sensitive insertion points within attention heads via activation deltas, then uses reinforcement learning to pick task vectors from a per-location activation bank for insertion. Empirical results across five vision-language benchmarks and two large multimodal model families show that STV consistently outperforms prior task-vector methods like MTV while drastically reducing insertion-search cost and preserving generalization. The approach offers a scalable, efficient pathway to leverage large multimodal models for many-shot ICL without finetuning or token-heavy prompts.

Abstract

Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.

Paper Structure

This paper contains 13 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of previous task-vector-based methods and our sensitivity-aware method that systematically determines both insertion locations and task vector values.
  • Figure 2: Attention head sensitivity across datasets (VizWiz and OK-VQA), models (Qwen-VL-7B and idefics2-8B), and sample sizes (100 vs. 500). Context-sensitive locations (black boxes) consistently emerge within tasks, validating the stability and structural patterns of activation delta, which are computed by contrasting query–context activations with query-only activations.
  • Figure 3: Overview of the STV Framework. It consists of two stages: (1) Sensitivity-aware location identification, where we compare query–context activations with query-only activations to compute activation deltas, and determine attention heads that consistently respond to contextual information. (2) Task vector selection, where candidate task vectors are drawn from a pre-computed activation bank, and reinforcement learning is used to choose the most suitable one at the identified locations.
  • Figure 4: a) Effect of Cluster Granularity on Task Vector Selection using Qwen-VL on VizWiz Dataset. b) Impact of Top-k Location Selection on Model Performance using Qwen-VL on VizWiz Dataset. c) Impact of Iterations $T$ and the Number Shot per Iteration on Model Performance using Qwen-VL on VizWiz Dataset.
  • Figure 5: FLOPs and Runtime Comparison between STV and Few-Shot ICL.