Table of Contents
Fetching ...

Learning to Select Visual In-Context Demonstrations

Eugene Lee, Yu-Chi Lin, Jiajie Diao

Abstract

Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.

Learning to Select Visual In-Context Demonstrations

Abstract

Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.

Paper Structure

This paper contains 52 sections, 8 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: An overview of our LSD (Learning to Select Demonstrations) framework. The process is a training loop where the MLLM acts as the Environment. (1) The Agent (a Dueling DQN) receives the current state $s_t$, which contains the query embedding $\mathbf{e}_q$ and the embeddings of all previously selected demonstrations $\{\mathbf{e}_1, \dots, \mathbf{e}_{t-1}\}$. (2) The agent's query-centric decoder outputs an advantage query $\mathbf{a}_s$, which is used to retrieve candidates $A_{\text{cand}}$ from the Task's Data via FAISS. (3) The agent selects the next best demonstration, $d_t$. (4) The full prompt (including the selected demos $d_1 \dots d_K$ and the query) is sent to the MLLM (Environment), which makes a prediction. (5) A Reward $r_t$ is calculated based on the prediction's accuracy (e.g., MAE). (6) This reward is used to update the agent's policy.
  • Figure 2: Performance vs. Number of Shots ($K$) on four datasets. We plot the MAE as $K$ increases. The results are task-dependent: (a), (c), (d) Objective Tasks (UTKFace, KonIQ, KADID): Our LSD policy (blue) consistently outperforms the kNN baseline (orange). (b) Subjective Task (AVA): The kNN baseline, which is based on visual similarity, consistently outperforms LSD.
  • Figure 3: Demonstration Set Analysis on UTKFace, plotted against $K$ shots. (a) MAE of Demo Labels vs. Query: The MAE between selected demo labels and the query's true label. LSD finds demos with closer labels. (b) Pairwise Label MAE: The MAE computed over all pairwise label differences among the selected demos. (c) Demo-Query Feature Similarity: The cosine similarity between demo embeddings and the query embedding. LSD balances similarity with other factors. (d) Pairwise Feature Similarity: The cosine similarity between every pair of selected demonstrations. LSD actively seeks diverse (low-similarity) demos.
  • Figure 4: Qualitative Comparison of Selected Demonstrations ($K=12$). (a) UTKFace: For an 8-year-old query, kNN selects only images with highly similar features (e.g., other young children). LSD selects a diverse spectrum of visual features (e.g., varied ages, genders, and lighting conditions) to build a richer context. (b) KADID-10k: For a motion-blurred query, kNN selects only other distorted versions of the same source image. LSD selects a varied set, including the pristine original and images with different distortion types from different source images, defining the quality boundaries.
  • Figure 5: Cross-MLLM Generalization (MAE $\downarrow$) on UTKFace vs. Number of Shots ($K$). We use the single LSD policy (trained on Gemma 3 4B-it) to select demos for two unseen MLLMs. The plots show our policy (blue line) versus the kNN (orange line) and Random (green line) baselines. (a) On Qwen 2.5 7B, our policy consistently outperforms kNN. (b) On Phi-3.5-vision, our policy performs on par with kNN. Both LSD and kNN significantly outperform the Random baseline.
  • ...and 7 more figures