Table of Contents
Fetching ...

A Note on Knowledge Distillation Loss Function for Object Classification

Defang Chen

TL;DR

This note investigates the loss functions underlying knowledge distillation (KD) for object classification and clarifies its connections to logits matching and output regularization. It demonstrates that the standard KD objective, under an infinity-temperature limit $\tau \to \infty$ and with equal-mean logits, is equivalent to a regularized logits-matching loss $L_{LM_r}$, revealing a fundamental link between KD and direct logits alignment. It also unifies label smoothing and confidence-penalty approaches under a skew-Jensen divergence framework and reframes KD as an adaptive label smoothing mechanism when combined with cross-entropy loss. Together, these results provide a theoretical grounding for different KD variants and a cohesive perspective on how teacher outputs regularize student predictions in object classification.

Abstract

This research note provides a quick introduction to the knowledge distillation loss function used in object classification. In particular, we discuss its connection to a previously proposed logits matching loss function. We further treat knowledge distillation as a specific form of output regularization and demonstrate its connection to label smoothing and entropy-based regularization.

A Note on Knowledge Distillation Loss Function for Object Classification

TL;DR

This note investigates the loss functions underlying knowledge distillation (KD) for object classification and clarifies its connections to logits matching and output regularization. It demonstrates that the standard KD objective, under an infinity-temperature limit and with equal-mean logits, is equivalent to a regularized logits-matching loss , revealing a fundamental link between KD and direct logits alignment. It also unifies label smoothing and confidence-penalty approaches under a skew-Jensen divergence framework and reframes KD as an adaptive label smoothing mechanism when combined with cross-entropy loss. Together, these results provide a theoretical grounding for different KD variants and a cohesive perspective on how teacher outputs regularize student predictions in object classification.

Abstract

This research note provides a quick introduction to the knowledge distillation loss function used in object classification. In particular, we discuss its connection to a previously proposed logits matching loss function. We further treat knowledge distillation as a specific form of output regularization and demonstrate its connection to label smoothing and entropy-based regularization.

Paper Structure

This paper contains 5 sections, 13 equations.

Theorems & Definitions (5)

  • Remark 1
  • Remark 2
  • proof
  • Remark 3
  • Remark 4