Table of Contents
Fetching ...

E-InMeMo: Enhanced Prompting for Visual In-Context Learning

Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara

TL;DR

This work addresses the sensitivity of visual in-context learning (ICL) prompts by introducing E-InMeMo, a parameter-efficient framework that injects a learnable border perturbation into in-context image pairs. A retriever selects a suitable in-context pair, which is refined by a trainable prompt and then processed by a frozen MAE-VQGAN to predict the query label, with only the prompt parameters updated during training. The method achieves state-of-the-art results on foreground segmentation and single-object detection, including robust performance on medical datasets and under domain shifts, while using only 27,540 trainable parameters. These results demonstrate that targeted learnable prompting can substantially improve visual ICL without extensive fine-tuning, offering a practical approach for data-efficient, domain-general vision tasks. The framework holds promise for real-world deployment where prompt quality and domain variability are critical factors.

Abstract

Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo

E-InMeMo: Enhanced Prompting for Visual In-Context Learning

TL;DR

This work addresses the sensitivity of visual in-context learning (ICL) prompts by introducing E-InMeMo, a parameter-efficient framework that injects a learnable border perturbation into in-context image pairs. A retriever selects a suitable in-context pair, which is refined by a trainable prompt and then processed by a frozen MAE-VQGAN to predict the query label, with only the prompt parameters updated during training. The method achieves state-of-the-art results on foreground segmentation and single-object detection, including robust performance on medical datasets and under domain shifts, while using only 27,540 trainable parameters. These results demonstrate that targeted learnable prompting can substantially improve visual ICL without extensive fine-tuning, offering a practical approach for data-efficient, domain-general vision tasks. The framework holds promise for real-world deployment where prompt quality and domain variability are critical factors.

Abstract

Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo

Paper Structure

This paper contains 20 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: A schematic comparison between standard visual ICL and E-InMeMo. (a) Visual ICL constructs a four-cell grid canvas composed of a query image, an in-context pair, and an empty cell (bottom-right) where the model generates the prediction. This canvas serves as the prompt, and the output (marked in the red box) is produced by passing it through a frozen large-scale vision model. (b) E-InMeMo enhances this paradigm by introducing a learnable prompt, a trainable perturbation designed to adjust the distribution of the in-context pair, thereby improving task guidance and prediction accuracy.
  • Figure 2: Performance comparison of visual ICL on a foreground segmentation task. Blue boxes indicate in-context pairs, while red boxes represent the predicted label images (query images are unmarked). The quality and similarity of the in-context pair significantly influence the prediction results. In the absence of a learnable prompt, performance is highly dependent on the semantic closeness between the query and in-context images. In contrast, E-InMeMo, with its learnable prompt, produces more stable and accurate predictions across varying input conditions.
  • Figure 3: Overview of the proposed E-InMeMo framework. The process begins with the In-context Learning Retriever, which selects an in-context pair from the dataset $\mathcal{S}$ for a given query image. A Prompt Enhancer $t_\phi(\cdot)$ then applies learnable perturbations to the in-context pair, producing an enhanced version. These enhanced in-context images, along with the query and an empty cell (bottom-right), are arranged into a four-cell grid canvas. This canvas is passed through a frozen MAE, which outputs predicted visual tokens corresponding to the empty cell. For visualization, the predicted tokens are decoded into an image using the decoder of VQGAN. During training, a ground-truth canvas, comprising the original in-context pair and the true label for the query, is encoded using a pre-trained VQGAN encoder to produce ground-truth tokens. A cross-entropy loss is computed on the empty cell to update only the parameters of the Prompt Enhancer, $\phi$.
  • Figure 4: Qualitative comparisons across baseline methods, prompt-SelF, and our proposed E-InMeMo on two downstream tasks: (a) Foreground segmentation and (b) Single-object detection. For each task, the top row shows the query image. The subsequent rows (from top to bottom) present results from FMLR (DINOv2), prompt-SelF, E-InMeMo, and the ground-truth label (GT), respectively. E-InMeMo consistently enables visual ICL to capture finer details and demonstrates robustness to mismatches between in-context and query images. Notably, it also appears to mitigate the negative impact of low-quality in-context pairs—an important advantage when the retriever fails to find highly similar samples.
  • Figure 5: Qualitative results of FMLR (DINOv2) and our proposed E-InMeMo on two medical datasets: (a) Kvasir and (b) ISIC. For each dataset, the top row shows the query image. The following rows are arranged from top to bottom in the order of FMLR (DINOv2), E-InMeMo, and the ground-truth label (GT), respectively.
  • ...and 3 more figures