Table of Contents
Fetching ...

MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models

Haoyang Li, Siyu Zhou, Liang Wang, Guodong Long

TL;DR

MAO tackles the inefficiency of existing prompt-tuning methods for Vision-Language Models by introducing a model-agnostic optimization framework that does not modify backbone architectures. It augments prompt tuning with Data-Driven Enhancement (hard negative sampling for base tasks and rapid pseudo-labeling for new tasks) and Alterable Regularization (dynamic cross-entropy over targeted candidate sets) to improve data distribution and feature processing. The two-step, plug-and-play workflow yields significant gains in base-to-new generalization and cross-dataset transfer while maintaining comparable computational costs to standard prompt tuners. Overall, MAO offers a practical, scalable solution for efficient prompt tuning in VLMs, with strong empirical support on 11 datasets and compatibility with multiple backbones.

Abstract

Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (MAO) for prompt tuning. Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency. The code of MAO is available at: https://github.com/JREion/M.A.O .

MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models

TL;DR

MAO tackles the inefficiency of existing prompt-tuning methods for Vision-Language Models by introducing a model-agnostic optimization framework that does not modify backbone architectures. It augments prompt tuning with Data-Driven Enhancement (hard negative sampling for base tasks and rapid pseudo-labeling for new tasks) and Alterable Regularization (dynamic cross-entropy over targeted candidate sets) to improve data distribution and feature processing. The two-step, plug-and-play workflow yields significant gains in base-to-new generalization and cross-dataset transfer while maintaining comparable computational costs to standard prompt tuners. Overall, MAO offers a practical, scalable solution for efficient prompt tuning in VLMs, with strong empirical support on 11 datasets and compatibility with multiple backbones.

Abstract

Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (MAO) for prompt tuning. Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency. The code of MAO is available at: https://github.com/JREion/M.A.O .

Paper Structure

This paper contains 21 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Architecture comparison between (a) existing prompt tuning backbones and (b) our Model-Agnostic Optimization (MAO) framework that introduces Data-Driven Enhancement and Alterable Regularization Module.
  • Figure 2: Framework of proposed MAO. MAO builds a two-step fine-tuning structure without altering components of prompt tuning backbones. In (a) base tasks, MAO introduces a hard negative sampler as Data-Driven Enhancement (DDE), and an Alterable Regularization (reg-B) that guides the model in learning the feature distribution of hard negatives and keeps generalization. Then in (b) new tasks, rapid pseudo-labeling is performed on unlabeled images as DDE using shared-parameter CLIP, followed by reg-N to constrain the fine-tuning on new classes. The inference process follows the settings of the original backbones.
  • Figure 3: Average HM performance of base-to-new generalization tasks of 3 backbones with plug-and-play methods, DePT zhang2024dept and our MAO.
  • Figure 4: The impact of (Left) the number of Top-$K$ in Data-Driven Enhancement for base-class tasks and (Right) shots of unlabeled images for new-class tasks on accuracy and computational cost of CoOp-based MAO.
  • Figure 5: Visual representation of semantic distance within a mini-batch sampled by (a) random sampling strategy and (b) MAO's hard negative sampler in Caltech101 dataset. Closer distance reveals stronger similarity.
  • ...and 1 more figures