Table of Contents
Fetching ...

Towards Global Optimal Visual In-Context Learning Prompt Selection

Chengming Xu, Chen Liu, Yikai Wang, Yuan Yao, Yanwei Fu

TL;DR

This work proposes a novel in-context example selection framework to approximately identify the global optimal prompt, i.e. choosing the best performing in-context examples from all alternatives for each query sample, and establishes the new state-of-the-arts.

Abstract

Visual In-Context Learning (VICL) is a prevailing way to transfer visual foundation models to new tasks by leveraging contextual information contained in in-context examples to enhance learning and prediction of query sample. The fundamental problem in VICL is how to select the best prompt to activate its power as much as possible, which is equivalent to the ranking problem to test the in-context behavior of each candidate in the alternative set and select the best one. To utilize more appropriate ranking metric and leverage more comprehensive information among the alternative set, we propose a novel in-context example selection framework to approximately identify the global optimal prompt, i.e. choosing the best performing in-context examples from all alternatives for each query sample. Our method, dubbed Partial2Global, adopts a transformer-based list-wise ranker to provide a more comprehensive comparison within several alternatives, and a consistency-aware ranking aggregator to generate globally consistent ranking. The effectiveness of Partial2Global is validated through experiments on foreground segmentation, single object detection and image colorization, demonstrating that Partial2Global selects consistently better in-context examples compared with other methods, and thus establish the new state-of-the-arts.

Towards Global Optimal Visual In-Context Learning Prompt Selection

TL;DR

This work proposes a novel in-context example selection framework to approximately identify the global optimal prompt, i.e. choosing the best performing in-context examples from all alternatives for each query sample, and establishes the new state-of-the-arts.

Abstract

Visual In-Context Learning (VICL) is a prevailing way to transfer visual foundation models to new tasks by leveraging contextual information contained in in-context examples to enhance learning and prediction of query sample. The fundamental problem in VICL is how to select the best prompt to activate its power as much as possible, which is equivalent to the ranking problem to test the in-context behavior of each candidate in the alternative set and select the best one. To utilize more appropriate ranking metric and leverage more comprehensive information among the alternative set, we propose a novel in-context example selection framework to approximately identify the global optimal prompt, i.e. choosing the best performing in-context examples from all alternatives for each query sample. Our method, dubbed Partial2Global, adopts a transformer-based list-wise ranker to provide a more comprehensive comparison within several alternatives, and a consistency-aware ranking aggregator to generate globally consistent ranking. The effectiveness of Partial2Global is validated through experiments on foreground segmentation, single object detection and image colorization, demonstrating that Partial2Global selects consistently better in-context examples compared with other methods, and thus establish the new state-of-the-arts.
Paper Structure (30 sections, 6 equations, 3 figures, 11 tables, 2 algorithms)

This paper contains 30 sections, 6 equations, 3 figures, 11 tables, 2 algorithms.

Figures (3)

  • Figure 1: Qualitative comparison between our method and VPR, specifically SupPR, in foreground segmentation. In each item we present the image grid in the same order as the input of MAE-VQGAN, i.e. in-context example and its label in the first row, query image and its prediction in the second row. The IoU is listed below each image grid.
  • Figure 2: (a) Scatter plot of visual similarity against IoU for VPR on segmentation. (b) Scatter plot of visual similarity against IoU for our method on segmentation. (c) Visualization of several cases with uncorrelated visual similarity and IoU. The first row presents samples with low similarity but proper in-context performance. The second row presents samples with high similarity but poor in-context performance. Captions below each image grid denote IoU and visual similarity sequentially.
  • Figure 3: Qualitative comparison between our method and VPR, specifically SupPR, in single object detection. For simplicity we present the bounding boxes on images instead of showing the image grids. In each item the left image denotes the in-context example and the right one denotes the query.