MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models
Haoyang Li, Siyu Zhou, Liang Wang, Guodong Long
TL;DR
MAO tackles the inefficiency of existing prompt-tuning methods for Vision-Language Models by introducing a model-agnostic optimization framework that does not modify backbone architectures. It augments prompt tuning with Data-Driven Enhancement (hard negative sampling for base tasks and rapid pseudo-labeling for new tasks) and Alterable Regularization (dynamic cross-entropy over targeted candidate sets) to improve data distribution and feature processing. The two-step, plug-and-play workflow yields significant gains in base-to-new generalization and cross-dataset transfer while maintaining comparable computational costs to standard prompt tuners. Overall, MAO offers a practical, scalable solution for efficient prompt tuning in VLMs, with strong empirical support on 11 datasets and compatibility with multiple backbones.
Abstract
Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (MAO) for prompt tuning. Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency. The code of MAO is available at: https://github.com/JREion/M.A.O .
