Table of Contents
Fetching ...

Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation

Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, Haiyu Wu, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi

TL;DR

This paper tackles forgetting during prompt-based adaptation of vision-language models by introducing an OT-based regularization that preserves pre-trained multimodal structure. It jointly aligns vision-text embeddings through an optimal transport loss, expanding the feasible space for prompt tuning and capturing cross-instance relationships. The method demonstrates strong improvements in base-to-novel generalization, cross-dataset evaluation, and domain generalization without data augmentation or ensembles, and includes theoretical justification for its benefits. The approach is practical and reproducible, with code to be released upon acceptance.

Abstract

Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at https://github.com/ChongQingNoSubway/Prompt-OT

Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation

TL;DR

This paper tackles forgetting during prompt-based adaptation of vision-language models by introducing an OT-based regularization that preserves pre-trained multimodal structure. It jointly aligns vision-text embeddings through an optimal transport loss, expanding the feasible space for prompt tuning and capturing cross-instance relationships. The method demonstrates strong improvements in base-to-novel generalization, cross-dataset evaluation, and domain generalization without data augmentation or ensembles, and includes theoretical justification for its benefits. The approach is practical and reproducible, with code to be released upon acceptance.

Abstract

Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at https://github.com/ChongQingNoSubway/Prompt-OT

Paper Structure

This paper contains 22 sections, 2 theorems, 15 equations, 3 figures, 7 tables.

Key Result

lemma 1

Let $\boldsymbol{X}_{\texttt{zs}}$ be a zero-shot representation distribution and $\epsilon > 0$ be a tolerance of the constraint. Suppose there exists a set $\mathcal{X}_{\texttt{pw}}$ such that for all $\boldsymbol{X} \in \mathcal{X}_{\texttt{pw}}$, $\mathcal{L}_{\mathrm{pw}}(\boldsymbol{X}, \bold

Figures (3)

  • Figure 1: (a) Comparison of point-wise constraints vs. our OT-based constraints: Unlike rigid point-wise alignment, our loss captures cross-instance relationships, effectively modeling correlations both within and between classes. (b) The error contours for base and novel tasks without constraint (top left), with point-wise constraint (bottom left), and our OT-based constraints (bottom right). Our OT-based constraint enlarges feasible parameter domains, striking a balance between adaptation and generalization.
  • Figure 2: The effectiveness of $\lambda$ on base-to-novel generalization tasks averaged over 11 datasets.
  • Figure S3: The overview of our proposed framework. Only the prompt tokens are trainable, and the rest of the weights in both the zero-shot encoders and the adapted encoders are frozen. Despite the cross-entropy $\mathcal{L}_{ce}$ adopted, we also adopt the proposed joint optimal transport loss $\mathcal{L}_{\mathrm{jot}}$ between joint zero-short representation and adapted representation to constrain the model.

Theorems & Definitions (5)

  • lemma 1
  • proof
  • theorem 1
  • proof
  • remark 1