APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning
Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang, Guandong Xu
TL;DR
This work tackles prompt sensitivity and cross-modal learning challenges in vision-language pretraining by introducing APLe, a token-wise adaptive multi-modal prompt learning framework. APLe decouples and sequentially trains language and vision prompts with an image adapter that uses FFT $\\mathcal{F}(\\ ext{.})$ and Gaussian filtering $\\mathcal{G}(\\text{.})$ to stabilize features; Stage I leverages CLIP zero-shot knowledge with cross-entropy and KL losses against $p_i^{zs}$, while Stage II performs multi-modal token adaptation without zero-shot guidance. Across base-to-novel, cross-dataset, and domain-generalization tasks on 11+ datasets, APLe delivers competitive generalization and shows robustness to prompt-length variations, outperforming or matching strong baselines in several settings. The proposed sequential, decoupled prompting strategy and image-adaptation mechanism offer a practical route to robustly deploying vision-language models in downstream tasks with reduced sensitivity to prompt design and modality imbalance.
Abstract
Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.
