Exploring Task-Level Optimal Prompts for Visual In-Context Learning

Yan Zhu; Huan Ma; Changqing Zhang

Exploring Task-Level Optimal Prompts for Visual In-Context Learning

Yan Zhu, Huan Ma, Changqing Zhang

TL;DR

This work addresses the high cost of per-sample prompt search in Visual In-Context Learning (VICL) for Vision Foundation Models. It reveals that a single task-level prompt can achieve near-optimal performance for most samples, enabling substantial cost reductions, and proposes two training-free search strategies—Top-K and Greedy—to identify such prompts efficiently. Empirical results across foreground segmentation, object detection, and colorization show state-of-the-art VICL performance with drastically reduced search time (over 98% saved) and results close to an Oracle baseline, validating the practicality of task-level prompting. The approach significantly improves the scalability and deployability of VICL by shifting from costly sample-specific prompts to a shared, task-level prompt with adaptive, low-overhead search methods.

Abstract

With the development of Vision Foundation Models (VFMs) in recent years, Visual In-Context Learning (VICL) has become a better choice compared to modifying models in most scenarios. Different from retraining or fine-tuning model, VICL does not require modifications to the model's weights or architecture, and only needs a prompt with demonstrations to teach VFM how to solve tasks. Currently, significant computational cost for finding optimal prompts for every test sample hinders the deployment of VICL, as determining which demonstrations to use for constructing prompts is very costly. In this paper, however, we find a counterintuitive phenomenon that most test samples actually achieve optimal performance under the same prompts, and searching for sample-level prompts only costs more time but results in completely identical prompts. Therefore, we propose task-level prompting to reduce the cost of searching for prompts during the inference stage and introduce two time-saving yet effective task-level prompt search strategies. Extensive experimental results show that our proposed method can identify near-optimal prompts and reach the best VICL performance with a minimal cost that prior work has never achieved.

Exploring Task-Level Optimal Prompts for Visual In-Context Learning

TL;DR

Abstract

Paper Structure (14 sections, 7 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 14 sections, 7 equations, 5 figures, 3 tables, 2 algorithms.

Introduction
Related works
Visual In-Context Learning
Visual Prompt Selection
Methods
Problem Setup
Sample-level Prompt
Task-level Prompt
Top-$K$ Prompt Selection Method
Greedy Prompt Selection Method
Experiments
Setup
Results
Conclusion

Figures (5)

Figure 1: (a) The deployment example of VICL in segmentation task. The one-shot case is presented in the figure, and the few-shot prediction is the average prediction under multiple different demonstrations. (b) Selecting different prompts makes a significant impact on the tasks. The variance across different prompts is large, even resulting in cases where the metric approaches zero. (c) Comparison of time complexity and performance. Our methods (task-level prompting) significantly reduce complexity while ensuring that the performance is not worse than that of more complex methods. (d) Motivation for task-level prompting. We find that during the testing phase, most samples achieve optimal performance under the same prompt. As shown in the figure, more than 27% of the samples achieve their best performance under the same prompt, which means that finding the optimal task-level prompt ensures that at least 27% of the samples obtain the best prompt. In contrast, the sample-level prompt searching strategy only finds the optimal prompt for 15.03% of the samples (for details please refer to results section).
Figure 2: Performance of all prompts on test set, where "Greedy" and "Oracle" indicate performance under prompts selected by our strategy and the upper-bound performance across all prompts, respectively.
Figure 3: Performance of the demonstration set with a length of 2 and its subsets for each sample, where half of the combinations shows a performance drop when adding samples.
Figure 4: In-context results retrieved by several unsupervised prompt selection methods, where "Ours" is the Greedy Prompt Selection Method.
Figure 5: A comparison of the performance of prompts selected by different strategies among all possible prompts (the far right indicates that the selected prompt is the best-performing one among all possible prompts).

Exploring Task-Level Optimal Prompts for Visual In-Context Learning

TL;DR

Abstract

Exploring Task-Level Optimal Prompts for Visual In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)