PLPP: Prompt Learning with Perplexity Is Self-Distillation for Vision-Language Models

Biao Liu; Wenyi Fang; Xiaoyu Wu; Yang Zheng; Zheng Hu; Bo Yuan

PLPP: Prompt Learning with Perplexity Is Self-Distillation for Vision-Language Models

Biao Liu, Wenyi Fang, Xiaoyu Wu, Yang Zheng, Zheng Hu, Bo Yuan

TL;DR

PLPP addresses prompt overfitting in vision-language models by introducing perplexity-based regularization that treats perplexity as a form of self-distillation. By computing a soft label distribution through a non-training LM head and using top-$k$ selections, it regularizes prompt learning without modifying encoders and accelerates convergence via mutual self-distillation. Across few-shot, base-to-novel, cross-dataset, and domain generalization benchmarks on 11 datasets, PLPP consistently improves over strong prompt-based baselines, with notable gains on domain-shift tasks like EuroSAT. This approach offers a plug-in, computation-efficient mechanism to enhance prompt generalization in VL models such as CLIP.

Abstract

Pre-trained Vision-Language (VL) models such as CLIP have demonstrated their excellent performance across numerous downstream tasks. A recent method, Context Optimization (CoOp), further improves the performance of VL models on downstream tasks by introducing prompt learning. CoOp optimizes a set of learnable vectors, aka prompt, and freezes the whole CLIP model. However, relying solely on CLIP loss to fine-tune prompts can lead to models that are prone to overfitting on downstream task. To address this issue, we propose a plug-in prompt-regularization method called PLPP (Prompt Learning with PerPlexity), which use perplexity loss to regularize prompt learning. PLPP designs a two-step operation to compute the perplexity for prompts: (a) calculating cosine similarity between the weight of the embedding layer and prompts to get labels, (b) introducing a language model (LM) head that requires no training behind text encoder to output word probability distribution. Meanwhile, we unveil that the essence of PLPP is inherently a form of self-distillation. To further prevent overfitting as well as to reduce the additional computation introduced by PLPP, we turn the hard label to soft label and choose top-$k$ values for calculating the perplexity loss. For accelerating model convergence, we introduce mutual self-distillation learning, that is perplexity and inverted perplexity loss. The experiments conducted on four classification tasks indicate that PLPP exhibits superior performance compared to existing methods.

PLPP: Prompt Learning with Perplexity Is Self-Distillation for Vision-Language Models

TL;DR

Abstract

PLPP: Prompt Learning with Perplexity Is Self-Distillation for Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)