Table of Contents
Fetching ...

Relational Representation Distillation

Nikolaos Giakoumoglou, Tania Stathaki

TL;DR

This paper introduces Relational Representation Distillation (RRD), a knowledge distillation framework that preserves the relational structure of embeddings by aligning teacher and student similarity distributions over a memory bank, rather than enforcing strict instance-wise similarity. RRD uses separate temperatures for the teacher and student to sharpen the student’s target distribution while maintaining secondary similarities, and it connects to InfoNCE as a limiting case and to KL divergence as a complementary objective. The approach yields significant performance gains over traditional KD and prior relational methods on CIFAR-100 and ImageNet, improves transferability of representations, and provides qualitative evidence from correlation analyses and t-SNE visualizations that the relational geometry of embeddings is better preserved. The method demonstrates robust improvements across tasks, datasets, and architectures, highlighting the practical impact of modeling relative relationships in KD and enabling more effective deployment of compact models.

Abstract

Knowledge distillation involves transferring knowledge from large, cumbersome teacher models to more compact student models. The standard approach minimizes the Kullback-Leibler (KL) divergence between the probabilistic outputs of a teacher and student network. However, this approach fails to capture important structural relationships in the teacher's internal representations. Recent advances have turned to contrastive learning objectives, but these methods impose overly strict constraints through instance-discrimination, forcing apart semantically similar samples even when they should maintain similarity. This motivates an alternative objective by which we preserve relative relationships between instances. Our method employs separate temperature parameters for teacher and student distributions, with sharper student outputs, enabling precise learning of primary relationships while preserving secondary similarities. We show theoretical connections between our objective and both InfoNCE loss and KL divergence. Experiments demonstrate that our method significantly outperforms existing knowledge distillation methods across diverse knowledge transfer tasks, achieving better alignment with teacher models, and sometimes even outperforms the teacher network.

Relational Representation Distillation

TL;DR

This paper introduces Relational Representation Distillation (RRD), a knowledge distillation framework that preserves the relational structure of embeddings by aligning teacher and student similarity distributions over a memory bank, rather than enforcing strict instance-wise similarity. RRD uses separate temperatures for the teacher and student to sharpen the student’s target distribution while maintaining secondary similarities, and it connects to InfoNCE as a limiting case and to KL divergence as a complementary objective. The approach yields significant performance gains over traditional KD and prior relational methods on CIFAR-100 and ImageNet, improves transferability of representations, and provides qualitative evidence from correlation analyses and t-SNE visualizations that the relational geometry of embeddings is better preserved. The method demonstrates robust improvements across tasks, datasets, and architectures, highlighting the practical impact of modeling relative relationships in KD and enabling more effective deployment of compact models.

Abstract

Knowledge distillation involves transferring knowledge from large, cumbersome teacher models to more compact student models. The standard approach minimizes the Kullback-Leibler (KL) divergence between the probabilistic outputs of a teacher and student network. However, this approach fails to capture important structural relationships in the teacher's internal representations. Recent advances have turned to contrastive learning objectives, but these methods impose overly strict constraints through instance-discrimination, forcing apart semantically similar samples even when they should maintain similarity. This motivates an alternative objective by which we preserve relative relationships between instances. Our method employs separate temperature parameters for teacher and student distributions, with sharper student outputs, enabling precise learning of primary relationships while preserving secondary similarities. We show theoretical connections between our objective and both InfoNCE loss and KL divergence. Experiments demonstrate that our method significantly outperforms existing knowledge distillation methods across diverse knowledge transfer tasks, achieving better alignment with teacher models, and sometimes even outperforms the teacher network.
Paper Structure (31 sections, 10 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 10 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the proposed relational representation distillation framework. Our method extracts normalized features from both teacher and student networks, computes similarity scores against a continuously updated memory bank, and aligns the student’s distribution to the teacher’s via KL divergence to effectively transfer relational knowledge.
  • Figure 2: Correlation matrix of the average logit difference between teacher and student logits on CIFAR-100 (lower is better). We use WRN-40-2 as the teacher and WRN-40-1 as the student. Methods have been re-implemented according to tian2022crd.
  • Figure 3: t-SNE visualizations of embeddings from teacher and student networks on CIFAR-100 (first 20 classes). We use WRN-40-2 as the teacher and WRN-40-1 as the student. Methods have been re-implemented according to tian2022crd.
  • Figure 4: Ablation study results on CIFAR-100 using WRN-40-2 as the teacher and WRN-16-2 as the student. We ablate (a) temperature parameters for teacher and student distributions, (b) memory bank size, and (c) weighting coefficient for the RRD loss. Curves are smoothed using Savitzky-Golay filtering for better visualization. Each experiment is run three times.