E-InMeMo: Enhanced Prompting for Visual In-Context Learning
Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara
TL;DR
This work addresses the sensitivity of visual in-context learning (ICL) prompts by introducing E-InMeMo, a parameter-efficient framework that injects a learnable border perturbation into in-context image pairs. A retriever selects a suitable in-context pair, which is refined by a trainable prompt and then processed by a frozen MAE-VQGAN to predict the query label, with only the prompt parameters updated during training. The method achieves state-of-the-art results on foreground segmentation and single-object detection, including robust performance on medical datasets and under domain shifts, while using only 27,540 trainable parameters. These results demonstrate that targeted learnable prompting can substantially improve visual ICL without extensive fine-tuning, offering a practical approach for data-efficient, domain-general vision tasks. The framework holds promise for real-world deployment where prompt quality and domain variability are critical factors.
Abstract
Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo
