Table of Contents
Fetching ...

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang, Guandong Xu

TL;DR

This work tackles prompt sensitivity and cross-modal learning challenges in vision-language pretraining by introducing APLe, a token-wise adaptive multi-modal prompt learning framework. APLe decouples and sequentially trains language and vision prompts with an image adapter that uses FFT $\\mathcal{F}(\\ ext{.})$ and Gaussian filtering $\\mathcal{G}(\\text{.})$ to stabilize features; Stage I leverages CLIP zero-shot knowledge with cross-entropy and KL losses against $p_i^{zs}$, while Stage II performs multi-modal token adaptation without zero-shot guidance. Across base-to-novel, cross-dataset, and domain-generalization tasks on 11+ datasets, APLe delivers competitive generalization and shows robustness to prompt-length variations, outperforming or matching strong baselines in several settings. The proposed sequential, decoupled prompting strategy and image-adaptation mechanism offer a practical route to robustly deploying vision-language models in downstream tasks with reduced sensitivity to prompt design and modality imbalance.

Abstract

Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

TL;DR

This work tackles prompt sensitivity and cross-modal learning challenges in vision-language pretraining by introducing APLe, a token-wise adaptive multi-modal prompt learning framework. APLe decouples and sequentially trains language and vision prompts with an image adapter that uses FFT and Gaussian filtering to stabilize features; Stage I leverages CLIP zero-shot knowledge with cross-entropy and KL losses against , while Stage II performs multi-modal token adaptation without zero-shot guidance. Across base-to-novel, cross-dataset, and domain-generalization tasks on 11+ datasets, APLe delivers competitive generalization and shows robustness to prompt-length variations, outperforming or matching strong baselines in several settings. The proposed sequential, decoupled prompting strategy and image-adaptation mechanism offer a practical route to robustly deploying vision-language models in downstream tasks with reduced sensitivity to prompt design and modality imbalance.

Abstract

Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.
Paper Structure (12 sections, 18 equations, 4 figures, 19 tables)

This paper contains 12 sections, 18 equations, 4 figures, 19 tables.

Figures (4)

  • Figure 1: Comparison of framework. CoOp adopts a uni-modal prompting. MaPLe demonstrates prompt learning in a multi-modal manner by the coupling function. APLe proposes an independent and sequential multi-modal prompt learning with adaptation.
  • Figure 2: Overview of APLe (Token-Wise Adaptive for Multi-Modal Prompt Learning) framework for prompting. APLe first deploys an image adapter to mitigate image noise and enhance image features effectively. Then, the prompts were trained sequentially as tokens with CLIP zero-shot knowledge and adaptation to alleviate the knowledge conflicts and prompt the synergy between modalities. Specifically, CE denotes the cross-entropy loss, KL denotes the Kullback-Leibler loss and $\lambda$ presents the hyper-parameter used to combine the loss functions.
  • Figure 3: Generalization performance comparison (HM) across various prompt lengths in the datasets, EuroSAT and StanfordCars.
  • Figure 4: Adaptation $vs$ Non-Adaptation Comparison.