Table of Contents
Fetching ...

Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning

Tianci Luo, Haohao Pan, Jinpeng Wang, Niu Lian, Xinrui Chen, Bin Chen, Shu-Tao Xia, Chun Yuan

Abstract

Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-aware Prompt Retrieval), which highlights the role of labels in prompt selection. Our framework first designs an image-label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation. We carefully design alternative optimization for experts and router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Code is available at https://github.com/luotc-why/CVPR26-LaPR.

Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning

Abstract

Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-aware Prompt Retrieval), which highlights the role of labels in prompt selection. Our framework first designs an image-label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation. We carefully design alternative optimization for experts and router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Code is available at https://github.com/luotc-why/CVPR26-LaPR.

Paper Structure

This paper contains 34 sections, 16 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We randomly sample $100$ query-prompt training pairs constructed by SupPR SupPR. We compute the label matching consistency and VICL performance to investigate their correlation.
  • Figure 2: Prompt retrieval paradigms. (a) Label-agnostic pipelines rely on image similarity and may yield label-inconsistent prompts and disturb inference. (b) LaPR considers both image similarity and label consistency for prompt retrieval, retrieving more relevant prompts.
  • Figure 3: Overview of LaPR. (a) LaPR architecture. Prompt labels are injected to form joint embeddings. On both sides, experts produce mode specific features and a query conditioned router picks the matching mode and extracts its information, resulting in label-aware query embeddings and query-relevant prompt embeddings. (b) Training framework. Each mini-batch alternates an expert step (performance-guided contrastive learning, router fixed) and a router step (label-guided contrastive learning with load balancing, experts frozen).
  • Figure 4: Qualitative visualization comparing LaPR (label-aware prompt retrieval) with SupPR (label-agnostic prompt retrieval). LaPR consistently retrieves label compatible prompts and yields visibly more accurate VICL predictions.
  • Figure 5: Proportion of each mode-specific expert selected under each category, visualized as a heatmap.