Confidence Preservation Property in Knowledge Distillation Abstractions

Dmitry Vengertsev; Elena Sherman

Confidence Preservation Property in Knowledge Distillation Abstractions

Dmitry Vengertsev, Elena Sherman

TL;DR

The paper investigates whether knowledge distillation, as implemented by TinyBERT, preserves the confidence (beyond accuracy) of a large teacher BERT model. It introduces a pairwise input-specific confidence measure, defines a global preservation criterion φ_cnf via the statistic $\boldsymbol{\text{σ}}(X^{train}) \le \kappa$, and derives a theoretical bound linking φ_cnf to distillation losses with $\kappa \le \gamma \sqrt{\beta / (|X^{train}|(1-\alpha))}$; with TinyBERT defaults this simplifies to $\boldsymbol{\text{σ}}(X^{train}) < \sqrt{\beta / |X^{train}|}$. Empirically, φ_cnf is not uniformly preserved across six GLUE tasks: the 6-layer TinyBERT ($S_{6L}$) maintains the property for three tasks but not for the other three, while the 4-layer version ($S_{4L}$) fails entirely. By tuning distillation hyperparameters, especially for the prediction and intermediate layers, the authors show that φ_cnf can be restored for the failed tasks without meaningful accuracy degradation, highlighting the practical value of considering confidence preservation in distillation design and tuning.

Abstract

Social media platforms prevent malicious activities by detecting harmful content of posts and comments. To that end, they employ large-scale deep neural network language models for sentiment analysis and content understanding. Some models, like BERT, are complex, and have numerous parameters, which makes them expensive to operate and maintain. To overcome these deficiencies, industry experts employ a knowledge distillation compression technique, where a distilled model is trained to reproduce the classification behavior of the original model. The distillation processes terminates when the distillation loss function reaches the stopping criteria. This function is mainly designed to ensure that the original and the distilled models exhibit alike classification behaviors. However, besides classification accuracy, there are additional properties of the original model that the distilled model should preserve to be considered as an appropriate abstraction. In this work, we explore whether distilled TinyBERT models preserve confidence values of the original BERT models, and investigate how this confidence preservation property could guide tuning hyperparameters of the distillation process.

Confidence Preservation Property in Knowledge Distillation Abstractions

TL;DR

, and derives a theoretical bound linking φ_cnf to distillation losses with

; with TinyBERT defaults this simplifies to

. Empirically, φ_cnf is not uniformly preserved across six GLUE tasks: the 6-layer TinyBERT (

) maintains the property for three tasks but not for the other three, while the 4-layer version (

) fails entirely. By tuning distillation hyperparameters, especially for the prediction and intermediate layers, the authors show that φ_cnf can be restored for the failed tasks without meaningful accuracy degradation, highlighting the practical value of considering confidence preservation in distillation design and tuning.

Abstract

Paper Structure (18 sections, 11 equations, 2 figures, 3 tables)

This paper contains 18 sections, 11 equations, 2 figures, 3 tables.

Introduction
Background and Motivation
Significance of Knowledge Distillation Models
TinyBERT Distillation
Distillation as Implicit Abstraction
Confidence Property Preservation Criterion
Pairwise Confidence Preservation Property
Confidence Preservation Property $\bm{\varphi_{cnf}}$ Dependencies
Experiment Setup
GLUE Tasks Benchmarks
Model Settings
Parameters Selection
Experimental Evaluations and Results
Confidence Preservation Prevalence (RQ1)
Confidence Preservation Dependencies (RQ2)
...and 3 more sections

Figures (2)

Figure 1: Learning abstract model via distillation. (a) end to end distillation flow; (b) task specific distillation
Figure 2: For two linguistic tasks: SST-2 and MRPC, the individual distributions of softmax confidence for the teacher and the student do not show significant difference, under comparable expected calibration error (ECE). However, the distribution of the pairwise confidence does highlight the issue of poor distillation for the MRPC task.

Theorems & Definitions (1)

definition thmcounterdefinition

Confidence Preservation Property in Knowledge Distillation Abstractions

TL;DR

Abstract

Confidence Preservation Property in Knowledge Distillation Abstractions

Authors

TL;DR

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (1)