Table of Contents
Fetching ...

PRG: Prompt-Based Distillation Without Annotation via Proxy Relational Graph

Yijin Xu, Jialun Liu, Hualiang Wei, Wenhui Li

TL;DR

This paper tackles annotation-free distillation from large foundation models by addressing task-irrelevant knowledge and high feature density through Proxy Relational Graphs (PRG). PRG constructs relational graphs from both a CLIP-based teacher and a lightweight student, incorporating sample nodes (features+logits) and class proxy nodes, with edges defined by correlations to capture task-focused relationships. Distillation proceeds via node and edge alignment between teacher and student graphs, guided by a combined loss that includes cross-entropy supervision and PRG-specific terms. Empirical results across general, fine-grained, and large-scale datasets show that PRG outperforms traditional KD baselines, achieving competitive annotation-free performance and demonstrating the method’s robustness and scalability for transferring knowledge from LFMs to lightweight models.

Abstract

In this paper, we propose a new distillation method for extracting knowledge from Large Foundation Models (LFM) into lightweight models, introducing a novel supervision mode that does not require manually annotated data. While LFMs exhibit exceptional zero-shot classification abilities across datasets, relying solely on LFM-generated embeddings for distillation poses two main challenges: LFM's task-irrelevant knowledge and the high density of features. The transfer of task-irrelevant knowledge could compromise the student model's discriminative capabilities, and the high density of features within target domains obstructs the extraction of discriminative knowledge essential for the task. To address this issue, we introduce the Proxy Relational Graph (PRG) method. We initially extract task-relevant knowledge from LFMs by calculating a weighted average of logits obtained through text prompt embeddings. Then we construct sample-class proxy graphs for LFM and student models, respectively, to model the correlation between samples and class proxies. Then, we achieve the distillation of selective knowledge by aligning the relational graphs produced by both the LFM and the student model. Specifically, the distillation from LFM to the student model is achieved through two types of alignment: 1) aligning the sample nodes produced by the student model with those produced by the LFM, and 2) aligning the edge relationships in the student model's graph with those in the LFM's graph. Our experimental results validate the effectiveness of PRG, demonstrating its ability to leverage the extensive knowledge base of LFMs while skillfully circumventing their inherent limitations in focused learning scenarios. Notably, in our annotation-free framework, PRG achieves an accuracy of 76.23\% (T: 77.9\%) on CIFAR-100 and 72.44\% (T: 75.3\%) on the ImageNet-1K.

PRG: Prompt-Based Distillation Without Annotation via Proxy Relational Graph

TL;DR

This paper tackles annotation-free distillation from large foundation models by addressing task-irrelevant knowledge and high feature density through Proxy Relational Graphs (PRG). PRG constructs relational graphs from both a CLIP-based teacher and a lightweight student, incorporating sample nodes (features+logits) and class proxy nodes, with edges defined by correlations to capture task-focused relationships. Distillation proceeds via node and edge alignment between teacher and student graphs, guided by a combined loss that includes cross-entropy supervision and PRG-specific terms. Empirical results across general, fine-grained, and large-scale datasets show that PRG outperforms traditional KD baselines, achieving competitive annotation-free performance and demonstrating the method’s robustness and scalability for transferring knowledge from LFMs to lightweight models.

Abstract

In this paper, we propose a new distillation method for extracting knowledge from Large Foundation Models (LFM) into lightweight models, introducing a novel supervision mode that does not require manually annotated data. While LFMs exhibit exceptional zero-shot classification abilities across datasets, relying solely on LFM-generated embeddings for distillation poses two main challenges: LFM's task-irrelevant knowledge and the high density of features. The transfer of task-irrelevant knowledge could compromise the student model's discriminative capabilities, and the high density of features within target domains obstructs the extraction of discriminative knowledge essential for the task. To address this issue, we introduce the Proxy Relational Graph (PRG) method. We initially extract task-relevant knowledge from LFMs by calculating a weighted average of logits obtained through text prompt embeddings. Then we construct sample-class proxy graphs for LFM and student models, respectively, to model the correlation between samples and class proxies. Then, we achieve the distillation of selective knowledge by aligning the relational graphs produced by both the LFM and the student model. Specifically, the distillation from LFM to the student model is achieved through two types of alignment: 1) aligning the sample nodes produced by the student model with those produced by the LFM, and 2) aligning the edge relationships in the student model's graph with those in the LFM's graph. Our experimental results validate the effectiveness of PRG, demonstrating its ability to leverage the extensive knowledge base of LFMs while skillfully circumventing their inherent limitations in focused learning scenarios. Notably, in our annotation-free framework, PRG achieves an accuracy of 76.23\% (T: 77.9\%) on CIFAR-100 and 72.44\% (T: 75.3\%) on the ImageNet-1K.
Paper Structure (21 sections, 15 equations, 4 figures, 6 tables)

This paper contains 21 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Using a Large Foundation Model (LFM) as a teacher for knowledge distillation in a dog classification task introduces challenges: 1) LFM captures irrelevant information like grass and balls, which are not needed for fine-grain classification and should not be distilled. 2) The LFM's vast feature space means that dog-specific features occupy a small fraction, making it hard for the student to learn discriminative knowledge in such a dense space.
  • Figure 2: Overview of our method: Text prompts are encoded via the CLIP text encoder to generate classification weights $T$, and sample images are processed via the CLIP image encoder to obtain teacher image features $I_f$. Each $T_i$ multiplies $I_f$ to produce logits $W_i$ for corresponding prompts. $W_i$ are weighted by their max logit and averaged to create weighted logits $W$, which concatenated with $I_f$ to form the teacher's sample node $F^t$. Simultaneously, the student model processes images to produce features $F_{\text{ori}}$ that match the dimensions of $I_f$ and classification logits $W^s$, which are then concatenated to form the student sample node $F^s$. Class proxy nodes $P^t$ and $P^s$ are updated by sample nodes. Pearson correlation coefficients between sample and class proxy nodes form relational graph edges $E^t$ and $E^s$. Finally, node and edge alignments transfer task-relevant knowledge to student model, optimizing distillation effectiveness.
  • Figure 3: Heatmaps of inter-sample correlation on CIFAR100alex2009learning, featuring the student model RepViT M1.1 wang2023repvit (left) and the teacher model ViT-L/14 from CLIP radford2021learning (right).
  • Figure 4: Ablation study evaluating the effectiveness of node and edge alignment at different $\lambda_{\text{node}}$ and $\lambda_{\text{edge}}$ settings on CIFAR100alex2009learning. The experiments involve RepViT M1.1 wang2023repvit and EfficientNet-B1 tan2019efficientnet as student models.