FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation
Zherui Zhang, Jiaxin Wu, Changwei Wang, Rongtao Xu, Longzhao Huang, Wenhao Xu, Wenbo Xu, Li Guo, Shibiao Xu
TL;DR
FDBPL tackles the efficiency bottleneck of distillation-based prompt learning for Vision-Language Models by offline sharing of teacher supervision through a Region Information Lookup, enabling rapid retrieval of region-level soft labels during training. It introduces Region-Aware Dual Prompt (RADP) learning to separately align informative and non-informative regions with positive and negative prompts, and Prompt-Cascaded Difference (PCD) learning to capture intra-class and inter-class relationships via first- and second-order difference spaces. The approach preserves prompt-learning efficiency while achieving strong zero-shot generalization, demonstrated by substantial improvements across 11 datasets in base-to-new and cross-dataset evaluations and an average training speed-up of 2.2x. Its combination of offline supervision, region-aware prompting, and cascaded semantic differences offers scalable, parameter-efficient adaptation of CLIP-like VLMs to diverse downstream tasks.
Abstract
Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {\large {\textbf{F}}}aster {\large {\textbf{D}}}istillation-{\large {\textbf{B}}}ased {\large {\textbf{P}}}rompt {\large {\textbf{L}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2\times$ faster training speed.
