Table of Contents
Fetching ...

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Brandon Huang, Chancharik Mitra, Assaf Arbelle, Leonid Karlinsky, Trevor Darrell, Roei Herzig

TL;DR

This paper tackles the context-length bottleneck in multimodal in-context learning by introducing Multimodal Task Vectors (MTV), compact implicit representations stored in attention heads to encode many-shot multimodal prompts. MTV computes mean activations from numerous inferences and identifies a small set of attention-head locations to patch, enabling downstream tasks without finetuning. Across interleaved LMMs and VL benchmarks, MTV scales with more examples, generalizes to related tasks, and can be combined with explicit few-shot prompts, while offering efficiency advantages over token-space ICL. Finetuning remains a potential upper bound but MTV preserves zero-shot capabilities and transfer, presenting a practical path to extend multimodal ICL beyond context limits. Overall, MTV demonstrates a scalable, non-finetuning approach to leverage many-shot multimodal context for robust VL reasoning.

Abstract

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV) -- compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference. Code: https://github.com/Brandon3964/MultiModal-Task-Vector

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

TL;DR

This paper tackles the context-length bottleneck in multimodal in-context learning by introducing Multimodal Task Vectors (MTV), compact implicit representations stored in attention heads to encode many-shot multimodal prompts. MTV computes mean activations from numerous inferences and identifies a small set of attention-head locations to patch, enabling downstream tasks without finetuning. Across interleaved LMMs and VL benchmarks, MTV scales with more examples, generalizes to related tasks, and can be combined with explicit few-shot prompts, while offering efficiency advantages over token-space ICL. Finetuning remains a potential upper bound but MTV preserves zero-shot capabilities and transfer, presenting a practical path to extend multimodal ICL beyond context limits. Overall, MTV demonstrates a scalable, non-finetuning approach to leverage many-shot multimodal context for robust VL reasoning.

Abstract

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV) -- compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference. Code: https://github.com/Brandon3964/MultiModal-Task-Vector
Paper Structure (33 sections, 6 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 6 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Multimodal Task Vectors (MTV) Overview. We overcome an LMM's context length limitation by encoding many shots of multimodal examples as activations in the LMM's latent space. We then directly replace this encoding into the LMM's activation space during downstream inference.
  • Figure 2: Multimodal Task Vectors (MTV). In the standard multimodal in-context learning (ICL) paradigm, the number of shots is limited by an LMM's context length. We solve this issue by first finding the mean activations corresponding to the last token of the examples' input (Step 1), and then calculating a set of attention head locations (Step 2) that best align with the downstream task. These mean activations are then replaced directly in these attention head locations (Step 3), enabling many-shot multimodal ICL.
  • Figure 3: Scaling of Qwen-MTV on VizWiz: (Left) We show the effect of varying the number of shots per iteration for a fixed 100 iterations. (Right) We also show the effect of varying numbers of iterations fixing 4 shots per iteration.
  • Figure 4: Efficiency. We show that for Flowers, MTV does scale to but only up to 100 examples in our experiments.
  • Figure 5: Efficiency. We show that for Flowers, MTV does scale to but only up to 100 examples in our experiments.