Table of Contents
Fetching ...

What Makes Good Examples for Visual In-Context Learning?

Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu

TL;DR

This work investigates visual in-context learning for large vision models and demonstrates that performance is highly sensitive to the chosen in-context examples. It introduces a prompt retrieval framework with unsupervised and supervised implementations to automate example selection, using cosine-based similarity in a learned feature space and a contrastive training objective. Empirical results across foreground segmentation, object detection, and image colorization show that supervised prompt retrieval yields the largest gains and that retrieval quality depends on semantic and spatial closeness to the query; distribution shifts reveal robustness gaps. The findings offer practical guidance for prompt design in MaaS settings and motivate further research into competing cues (spatial vs. semantic) to improve prompt quality in vision tasks.

Abstract

Large-scale models trained on broad data have recently become the mainstream architecture in computer vision due to their strong generalization performance. In this paper, the main focus is on an emergent ability in large vision models, known as in-context learning, which allows inference on unseen tasks by conditioning on in-context examples (a.k.a.~prompt) without updating the model parameters. This concept has been well-known in natural language processing but has only been studied very recently for large vision models. We for the first time provide a comprehensive investigation on the impact of in-context examples in computer vision, and find that the performance is highly sensitive to the choice of in-context examples. To overcome the problem, we propose a prompt retrieval framework to automate the selection of in-context examples. Specifically, we present (1) an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and (2) a supervised prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance. The results demonstrate that our methods can bring non-trivial improvements to visual in-context learning in comparison to the commonly-used random selection.

What Makes Good Examples for Visual In-Context Learning?

TL;DR

This work investigates visual in-context learning for large vision models and demonstrates that performance is highly sensitive to the chosen in-context examples. It introduces a prompt retrieval framework with unsupervised and supervised implementations to automate example selection, using cosine-based similarity in a learned feature space and a contrastive training objective. Empirical results across foreground segmentation, object detection, and image colorization show that supervised prompt retrieval yields the largest gains and that retrieval quality depends on semantic and spatial closeness to the query; distribution shifts reveal robustness gaps. The findings offer practical guidance for prompt design in MaaS settings and motivate further research into competing cues (spatial vs. semantic) to improve prompt quality in vision tasks.

Abstract

Large-scale models trained on broad data have recently become the mainstream architecture in computer vision due to their strong generalization performance. In this paper, the main focus is on an emergent ability in large vision models, known as in-context learning, which allows inference on unseen tasks by conditioning on in-context examples (a.k.a.~prompt) without updating the model parameters. This concept has been well-known in natural language processing but has only been studied very recently for large vision models. We for the first time provide a comprehensive investigation on the impact of in-context examples in computer vision, and find that the performance is highly sensitive to the choice of in-context examples. To overcome the problem, we propose a prompt retrieval framework to automate the selection of in-context examples. Specifically, we present (1) an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and (2) a supervised prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance. The results demonstrate that our methods can bring non-trivial improvements to visual in-context learning in comparison to the commonly-used random selection.
Paper Structure (33 sections, 4 equations, 15 figures, 4 tables)

This paper contains 33 sections, 4 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: (a) Different choices of in-context examples (outlined in green) often lead to significantly different results. Here we show 30 random query images (x-axis) from Pascal-$5^{i}$AmirrezaShaban2017OneShotLF split 0, and measure the performance range using 50 different in-context examples. (b) We propose a prompt retrieval framework aiming to automate the selection of in-context examples. We provide two implementations of the idea: one is unsupervised while the other is supervised, both outperforming random selection by a clear margin.
  • Figure 2: Overview of the supervised prompt retrieval method. The main idea is to compute the in-context learning result for each source example, and pick those with the highest/lowest results to form a positive/negative set for contrastive learning.
  • Figure 3: In-context examples retrieved by UnsupPR and SupPR. In each grid, the first row contains the prompt while the second row contains the query and prediction. The in-context examples found by SupPR are more similar than those found by UnsupPR to the queries in a numer of ways: semantics (e.g., (e)), background (e.g., (a)), object pose (e.g., (b), object appearance (e.g., (i)), viewpoint (e.g., (k)), etc. More examples can be found in the supplementary.
  • Figure 4: (Left) Impact of the size of retrieval set. (Right) Ablation study on distance metric used to compute the score function in Eq. \ref{['eq:score']}. It can be observed that different metrics perform similarly.
  • Figure 5: (Left) Impact of the number of in-context examples. (Right) More in-context examples can lead to better performance. The query in each grid is shown in the bottom right.
  • ...and 10 more figures