Table of Contents
Fetching ...

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

Ejafa Bassam, Dawei Zhu, Kaigui Bian

TL;DR

This work reframes knowledge distillation as a ranking problem under the Plackett-Luce model, introducing Plackett-Luce Distillation (PLD) that uses a teacher-optimal permutation $\pi^*$ to impose a confidence-weighted, list-wise loss $\mathcal{L}_{\mathrm{PLD}}$. The first ranking step aligns with cross-entropy on the true label, while subsequent steps are guided by teacher logits; the weights $\alpha_k$ derived from the teacher's softmax ensure a convex, translation-invariant objective that subsumes CE, ListMLE, and P-ListMLE. The authors show PLD is convex with closed-form gradients and demonstrate consistent performance gains over KD, DIST, and related methods across CIFAR-100, ImageNet-1K, and MS-COCO, for both homogeneous and heterogeneous teacher-student pairs, and under extended training. This approach provides a unified, efficient KD framework that reduces hyperparameter sensitivity and extends naturally to dense prediction tasks, suggesting broad applicability for model compression without architectural changes.

Abstract

Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra- and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce "Plackett-Luce Distillation (PLD)", a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single "teacher-optimal" ranking. The true label is placed first, followed by the remaining classes in descending teacher confidence. This process yields a convex and translation-invariant surrogate that subsumes weighted cross-entropy. Empirically, across CIFAR-100, ImageNet-1K, and MS-COCO, PLD achieves consistent gains across diverse architectures and distillation objectives, including divergence-based, correlation-based, and feature-based methods, in both homogeneous and heterogeneous teacher-student pairs.

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

TL;DR

This work reframes knowledge distillation as a ranking problem under the Plackett-Luce model, introducing Plackett-Luce Distillation (PLD) that uses a teacher-optimal permutation to impose a confidence-weighted, list-wise loss . The first ranking step aligns with cross-entropy on the true label, while subsequent steps are guided by teacher logits; the weights derived from the teacher's softmax ensure a convex, translation-invariant objective that subsumes CE, ListMLE, and P-ListMLE. The authors show PLD is convex with closed-form gradients and demonstrate consistent performance gains over KD, DIST, and related methods across CIFAR-100, ImageNet-1K, and MS-COCO, for both homogeneous and heterogeneous teacher-student pairs, and under extended training. This approach provides a unified, efficient KD framework that reduces hyperparameter sensitivity and extends naturally to dense prediction tasks, suggesting broad applicability for model compression without architectural changes.

Abstract

Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra- and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce "Plackett-Luce Distillation (PLD)", a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single "teacher-optimal" ranking. The true label is placed first, followed by the remaining classes in descending teacher confidence. This process yields a convex and translation-invariant surrogate that subsumes weighted cross-entropy. Empirically, across CIFAR-100, ImageNet-1K, and MS-COCO, PLD achieves consistent gains across diverse architectures and distillation objectives, including divergence-based, correlation-based, and feature-based methods, in both homogeneous and heterogeneous teacher-student pairs.

Paper Structure

This paper contains 40 sections, 28 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: (a) Varying the CE mixing weight $\alpha$ reveals that KD and DIST have different sensitivities-too much CE hurts both, while a sweet spot near $\alpha\approx0.1$ maximizes Top-1 accuracy. (b) Under extended training (100 vs. 300 epochs), PLD consistently outperforms both KD and DIST, demonstrating its sustained gains.
  • Figure 2: (a) Homogeneous setting: larger teachers and smaller students within the same architecture family. (b) Heterogeneous setting: a fixed ResNet-50 student distilled from diverse teacher architectures.
  • Figure 3: PLD loss surfaces at different teacher temperatures. (Top row) $T=2.0$ and $T=1.0$; (Bottom row) $T=0.5$ and $T=0.1$. Lowering $T$ below 1.0 flattens convexity.
  • Figure 4: Loss landscapes of three distillation methods: (a) DIST exhibits a sharp dip yet remains effectively planar; (b) KD shows moderate convexity; (c) PLD (ours) exhibits better convexity with contours mostly centered at the origin.