Table of Contents
Fetching ...

FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Zherui Zhang, Jiaxin Wu, Changwei Wang, Rongtao Xu, Longzhao Huang, Wenhao Xu, Wenbo Xu, Li Guo, Shibiao Xu

TL;DR

FDBPL tackles the efficiency bottleneck of distillation-based prompt learning for Vision-Language Models by offline sharing of teacher supervision through a Region Information Lookup, enabling rapid retrieval of region-level soft labels during training. It introduces Region-Aware Dual Prompt (RADP) learning to separately align informative and non-informative regions with positive and negative prompts, and Prompt-Cascaded Difference (PCD) learning to capture intra-class and inter-class relationships via first- and second-order difference spaces. The approach preserves prompt-learning efficiency while achieving strong zero-shot generalization, demonstrated by substantial improvements across 11 datasets in base-to-new and cross-dataset evaluations and an average training speed-up of 2.2x. Its combination of offline supervision, region-aware prompting, and cascaded semantic differences offers scalable, parameter-efficient adaptation of CLIP-like VLMs to diverse downstream tasks.

Abstract

Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {\large {\textbf{F}}}aster {\large {\textbf{D}}}istillation-{\large {\textbf{B}}}ased {\large {\textbf{P}}}rompt {\large {\textbf{L}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2\times$ faster training speed.

FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

TL;DR

FDBPL tackles the efficiency bottleneck of distillation-based prompt learning for Vision-Language Models by offline sharing of teacher supervision through a Region Information Lookup, enabling rapid retrieval of region-level soft labels during training. It introduces Region-Aware Dual Prompt (RADP) learning to separately align informative and non-informative regions with positive and negative prompts, and Prompt-Cascaded Difference (PCD) learning to capture intra-class and inter-class relationships via first- and second-order difference spaces. The approach preserves prompt-learning efficiency while achieving strong zero-shot generalization, demonstrated by substantial improvements across 11 datasets in base-to-new and cross-dataset evaluations and an average training speed-up of 2.2x. Its combination of offline supervision, region-aware prompting, and cascaded semantic differences offers scalable, parameter-efficient adaptation of CLIP-like VLMs to diverse downstream tasks.

Abstract

Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {\large {\textbf{F}}}aster {\large {\textbf{D}}}istillation-{\large {\textbf{B}}}ased {\large {\textbf{P}}}rompt {\large {\textbf{L}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving faster training speed.

Paper Structure

This paper contains 27 sections, 15 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Training Efficiency Advantage. The results are measured in minutes, and our method demonstrates significant improvements across all 11 datasets, as well as in average training time.
  • Figure 2: From PL to FDBPL (Ours). Prompt learning (PL) adapts CLIP to downstream tasks via learnable parameters. We compare three methods: (a) Native PL employs dataset hard labels, offering fast training but suffering from overfitting on seen classes, which degrades zero-shot performance on unseen classes. (b) Distillation-Based PL uses a teacher CLIP network to transfer generalization knowledge without specific labels, often on unlabeled data. However, the online inference required for soft label generation reduces its training efficiency. (c) Our proposed Faster Distillation-Based PL (FDBPL) achieves both high training efficiency and strong zero-shot generalization by pre-storing shareable soft labels across epochs and employing fast I/O.
  • Figure 3: Relationships between Components within FDBPL. The training paradigm based on random region images inevitably include regions with insufficient information content (blue areas) compared to well-defined regions of interest (ROI) under sharp distributions (red areas). To address this challenge, we implement RADP (Region-Adaptive Dual Prompt) learning, which uses positive prompts for high-information regions and negative prompts for low-information regions, dual-prompt method enables independent similarity training for distinct region types. Furthermore, we introduce PCD (Positive-Contrastive Discrimination) learning, which uses positive-negative space contrastive analysis to capture both intra-class and inter-class latent relationships, thereby benefiting complex recognition scenarios.
  • Figure 4: FDBPL Framework. (a) To mitigate native prompt learning's strong dependency on hard-labeled downstream dataset, we introduce a larger teacher CLIP network that transfers generalized knowledge through unlabeled regional images. For efficient knowledge distillation, we pre-store spatial coordinates of randomly cropped sub-regions, data augmentation types, and teacher-generated soft labels in a Region Information Lookup (RIL) Table - a design that eliminates redundant online inference by sharing soft labels across training epochs. Label sparsification is adopted to prevent I/O bottlenecks caused by excessive soft-label storage requirements. (b) Through direct retrieval of regional images and shared supervision signals from storage devices, we develop Region-Aware Dual-Prompt Learning (RADP) with learnable positive-negative prompts that independently align information-rich and information-poor regions, dual-similarity mechanism enables the student CLIP model to simultaneously recognize correct semantic categories and reject uncertain regions. Specifically, within RADP, one path involves the student image encoder processing image regions that possess clear and distinguishable semantic content, as depicted in the pink input stream . Concurrently, another path involves the student image encoder receiving image regions that lack clear semantic content, such as background areas resulting from random cropping, object edge portions, or regions which are otherwise difficult to classify, as shown in the green input stream . Subsequently, a Prompt-Cascaded Difference (PCD) Learning module establishes cascaded difference spaces: first-order and second-order difference spaces that respectively capture intra-class variations and inter-class relationships, thereby enhancing zero-shot recognition capabilities in complex scenarios.
  • Figure 5: Label Sparsity Strategy. A large-capacity teacher network, such as CLIP trained on ImageNet-1K, produces a logit output with 1000 categories (a). Storing these complete logits as "Soft Label" in the Region Information Lookup (RIL) Table (b) incurs significant storage overhead, which substantially hinders the speed of subsequent knowledge distillation for soft supervision. To minimize storage consumption of the "Soft Label" field, we adopt two label sparsification strategies: Marginal Smoothing with Top-K (MS) (c) and Marginal Re-Norm with Top-K (MR) (d), where K represents the number of most salient categories retained. MS preserves the complete information of the top-K categories while averaging the remaining probabilities. MR, in contrast, re-normalizes the probabilities of the top-K categories and sets the probabilities of the remaining categories to zero.
  • ...and 7 more figures