Table of Contents
Fetching ...

Parallel In-context Learning for Large Vision Language Models

Shin'ya Yamaguchi, Daiki Chijiwa, Tamao Sakao, Taku Hasegawa

Abstract

Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

Parallel In-context Learning for Large Vision Language Models

Abstract

Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.
Paper Structure (17 sections, 1 theorem, 11 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 1 theorem, 11 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Theorem 5.1

Given a ground truth $y$, outputs from $K$ models $\mathbf{o}=\{o_1,\dots,o_K\}$, and a reconstruction function $f:\mathbf{o}\mapsto\hat{y}$ (e.g., PoE), the error rate $p_\mathrm{err} = \mathrm{Pr}[y\neq f(\mathbf{o})]$ is bounded by where $H$ is entropy and $\mathcal{I}(\mathbf{o}, y)$ is defined as follows: $I(o;y)$ is mutual information between $o$ and $y$, and $I_\mathrm{multi}(\cdot)$ is c

Figures (7)

  • Figure 1: Parallel-ICL. Instead of using a full demonstration context, we propose partitioning it into smaller chunk contexts (context chunking) and then integrating the logits from the chunked contexts to output (context compilation) for efficient inference in multi-modal in-context learning (MM-ICL) by large vision-language models (LVLMs). Parallel-ICL enhances inference speed while maintaining competitive performance to the original MM-ICL.
  • Figure 2: Pipeline of Parallel-ICL. We cluster demonstrations in a multi-modal feature space and utilize each cluster as chunks (context chunking). Then, we process chunk-wise contexts with LVLMs and weight their outputs (logits) based on query-chunk similarity, composing an ensemble for the final prediction as PoE (context compilation). This can be computed by the weighted sum of outputs at the logit level.
  • Figure 3: LLaVA-OV-7B
  • Figure 4: Qwen2.5-VL-7B
  • Figure 5: InternVL3.5-8B
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 5.1: Brown_2009_information_theory_ensemble and Zhou_2010_multi-information_ensemble