Table of Contents
Fetching ...

TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, Ruixiang Tang

TL;DR

The paper addresses how multimodal ICL in LVLMs depends on the arrangement of demonstrations by introducing a task-mapping framework that formalizes local and global mappings across ICDs. It proposes TACO, a lightweight transformer with a task guider and task-aware attention, to dynamically configure ICL sequences through task-mapping signals injected into autoregressive decoding. By training on LVLM-derived data with an Oracle-guided retrieval strategy and evaluating across nine datasets and five LVLMs, TACO achieves consistent gains, particularly in generalized-mapping tasks, while maintaining efficiency with a small parameter count ($\approx 140$M). The work provides a principled, model-centric approach to optimizing multimodal ICL prompts and demonstrates potential generalization to NLP and text-to-image tasks. Overall, task mapping emerges as a robust lens for interpreting and enhancing multimodal ICL learning and reasoning.

Abstract

Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.

TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

TL;DR

The paper addresses how multimodal ICL in LVLMs depends on the arrangement of demonstrations by introducing a task-mapping framework that formalizes local and global mappings across ICDs. It proposes TACO, a lightweight transformer with a task guider and task-aware attention, to dynamically configure ICL sequences through task-mapping signals injected into autoregressive decoding. By training on LVLM-derived data with an Oracle-guided retrieval strategy and evaluating across nine datasets and five LVLMs, TACO achieves consistent gains, particularly in generalized-mapping tasks, while maintaining efficiency with a small parameter count (M). The work provides a principled, model-centric approach to optimizing multimodal ICL prompts and demonstrates potential generalization to NLP and text-to-image tasks. Overall, task mapping emerges as a robust lens for interpreting and enhancing multimodal ICL learning and reasoning.

Abstract

Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.

Paper Structure

This paper contains 45 sections, 37 equations, 11 figures, 18 tables.

Figures (11)

  • Figure 1: Examples of 2-shot multimodal ICL. (a) In specific-mapping tasks, the ICDs' local mappings are relatively consistent, and the ICL sequence’s global mapping matches them. Their clarity directly affects the LVLM’s reasoning process. The in-context lens in (c) also reflects this latent reasoning shift induced by task mapping. (b) In generalized-mapping tasks, LVLM needs to integrate each local mapping into a cohesive global mapping for reasoning. Overreliance on isolated features (e.g., the visual cue of a boat) can break this cohesion.
  • Figure 2: Results on HatefulMemes under various settings. "+" denotes combining two settings.
  • Figure 3: (a-b) Results of different ICL sequence configuration methods on VQAv2 and HatefulMemes. (c-d) Task mapping cohesion analysis of different ICL sequence configuration methods on VQAv2.
  • Figure 4: Our overall pipeline, shown in (b), consists of three parts: a demonstration library, TACO, and a pre-trained LVLM. TACO treats each $(I,Q,R)$ example in the demonstration library as a token. (a) shows TACO training using the LVLM-constructed training data. (c) shows that, given a new query sample, TACO autoregressively retrieves samples from the demonstration library to form a high-quality ICL sequence for LVLM inference.
  • Figure 5: Results of TACO with and without task-aware attention under different $N$-$n$ settings across three datasets, where $N$ is the training sequence shot and $n$ is the generation sequence shot.
  • ...and 6 more figures