Table of Contents
Fetching ...

Towards Reliable and Holistic Visual In-Context Learning Prompt Selection

Wenxiao Wu, Jing-Hao Xue, Chengming Xu, Chen Liu, Xinwei Sun, Changxin Gao, Nong Sang, Yanwei Fu

TL;DR

The paper tackles the problem of selecting reliable in-context prompts for Visual In-Context Learning (VICL) by challenging the prevalent similarity-priority heuristic and addressing coverage gaps from random sampling. It introduces RH-Partial2Global, which combines a jackknife conformal-prediction-based reliable candidate selection with a covering-design-based holistic sampling to construct robust alternative sets for VICL prompts. Empirical results across segmentation, object detection, and colorization demonstrate systematic gains over Partial2Global, with negative KL divergence as the preferred conformity function and additional improvements when using test-time voting. The work also investigates the universality of the conformal-prediction strategy on VPR variants and discusses limitations related to dataset size and potential data bias, highlighting practical impact for more reliable and holistic VICL prompt selection.

Abstract

Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.

Towards Reliable and Holistic Visual In-Context Learning Prompt Selection

TL;DR

The paper tackles the problem of selecting reliable in-context prompts for Visual In-Context Learning (VICL) by challenging the prevalent similarity-priority heuristic and addressing coverage gaps from random sampling. It introduces RH-Partial2Global, which combines a jackknife conformal-prediction-based reliable candidate selection with a covering-design-based holistic sampling to construct robust alternative sets for VICL prompts. Empirical results across segmentation, object detection, and colorization demonstrate systematic gains over Partial2Global, with negative KL divergence as the preferred conformity function and additional improvements when using test-time voting. The work also investigates the universality of the conformal-prediction strategy on VPR variants and discusses limitations related to dataset size and potential data bias, highlighting practical impact for more reliable and holistic VICL prompt selection.

Abstract

Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.

Paper Structure

This paper contains 11 sections, 1 theorem, 8 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Considering a $(K,k,t)$ covering design $C(K,k,t)$, the Schonheim lower bound for such a covering design’s size is

Figures (4)

  • Figure 1: Qualitative comparison between our proposed RH-Partial2Global and Partial2Global in the foreground segmentation task. For each comparison item, we display the image grid following the input order of MAE-VQGAN: the first row contains the in-context example alongside its corresponding label, while the second row shows the query image and its predicted result. The IoU value is reported below each image grid to facilitate performance evaluation.
  • Figure 2: Visualization of scatter plot with regression line of similarity and IoU scores.
  • Figure 3: Performance trends of VPR when enhanced by the $\mathcal{S}_{cp}$ and $\mathcal{S}_{fill}$ strategies, evaluated across a range of $\alpha$ parameter values. The setting '$\alpha=1.00$' represents the baseline performance of the original VPR methods without these proposed strategies.
  • Figure 4: Qualitative comparison of single object detection performance between our RH-Partial2Global and Partial2Global. To enhance visual clarity and simplicity, bounding boxes are overlaid directly onto the images, rather than displaying the complete image grids. Each depicted item consists of an in-context example (left) and the corresponding query image (right).

Theorems & Definitions (1)

  • Theorem 1: Schonheim Lower Bound schonheim1964coverings