Table of Contents
Fetching ...

PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Tianci Luo, Jinpeng Wang, Shiyu Qin, Niu Lian, Yan Feng, Bin Chen, Chun Yuan, Shu-Tao Xia

Abstract

Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.

PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Abstract

Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.
Paper Structure (29 sections, 8 equations, 11 figures, 9 tables)

This paper contains 29 sections, 8 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: (a) Condenser performs patch-wise fusion to fuse composite prompt, while leveraging model-agnostic supervision signals at the input level. (b) PromptHub transcends Condenser by enforcing a locality-aware chain that unifies fusion-utilization-prediction. It aligns spatial priors into coherent prompt representations, reinforces the backbone’s concentration on fused cues, and integrates label prediction to maintain the integrity of VICL pipeline. (c) Comparison of Condenser and PromptHub across three tasks under both single-prompting and multi-prompting configurations.
  • Figure 2: The training and inference framework of PromptHub based on MAE-VQGAN.
  • Figure 3: PromptHub module design.$N$ prompt pairs and query image are embedded into the MAE patch space, where locality-enhanced fusion integrates spatially cues into a fused prompt aligned with query’s informative content.
  • Figure 4: Data augment of PromptHub. In training, the top-$N$ pairs are randomly substituted with either query pairs or random pairs.
  • Figure 5: Performance comparison with baselines in multi-prompt VICL scenario.
  • ...and 6 more figures