Table of Contents
Fetching ...

MiniDisc: Minimal Distillation Schedule for Language Model Compression

Chen Zhang, Yang Yang, Qifan Wang, Jiahao Liu, Jingang Wang, Wei Wu, Dawei Song

TL;DR

MiniDisc addresses the inefficiency of teacher assistant-based distillation by introducing a minimal, one-shot scheduling method. It defines a $\\lambda$-tradeoff to capture the scale–performance balance and embeds candidate evaluation in a sandwich framework with gridding and pruning to generate and optimize candidates efficiently. Empirical results on GLUE and large LMs show that MiniDisc achieves competitive accuracy with significantly reduced search cost and scales to models with billions of parameters; analyses confirm the existence of a scale–performance tradeoff and the sufficiency of a single teacher assistant. The approach offers practical, scalable compression for resource-constrained deployment and provides a pathway to automatic optimization via the proposed approximations and potential residual distillation extensions.

Abstract

Recent studies have uncovered that language model distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is of vital importance to bring the knowledge from the teacher to the student. However, existing teacher assistant-based methods require maximally many trials before scheduling an optimal teacher assistant. To this end, we propose a minimal distillation schedule (MiniDisc) for scheduling the optimal teacher assistant in minimally one trial. In particular, motivated by the finding that the performance of the student is positively correlated to the scale-performance tradeoff of the teacher assistant, MiniDisc is designed with a $λ$-tradeoff to measure the optimality of the teacher assistant without trial distillation to the student. MiniDisc then can schedule the optimal teacher assistant with the best $λ$-tradeoff in a sandwich framework. MiniDisc is evaluated with an extensive set of experiments on GLUE. Experimental results demonstrate the improved efficiency our MiniDisc compared to several state-of-the-art baselines. We further apply MiniDisc to a language model with billions of parameters and show its scalability.

MiniDisc: Minimal Distillation Schedule for Language Model Compression

TL;DR

MiniDisc addresses the inefficiency of teacher assistant-based distillation by introducing a minimal, one-shot scheduling method. It defines a -tradeoff to capture the scale–performance balance and embeds candidate evaluation in a sandwich framework with gridding and pruning to generate and optimize candidates efficiently. Empirical results on GLUE and large LMs show that MiniDisc achieves competitive accuracy with significantly reduced search cost and scales to models with billions of parameters; analyses confirm the existence of a scale–performance tradeoff and the sufficiency of a single teacher assistant. The approach offers practical, scalable compression for resource-constrained deployment and provides a pathway to automatic optimization via the proposed approximations and potential residual distillation extensions.

Abstract

Recent studies have uncovered that language model distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is of vital importance to bring the knowledge from the teacher to the student. However, existing teacher assistant-based methods require maximally many trials before scheduling an optimal teacher assistant. To this end, we propose a minimal distillation schedule (MiniDisc) for scheduling the optimal teacher assistant in minimally one trial. In particular, motivated by the finding that the performance of the student is positively correlated to the scale-performance tradeoff of the teacher assistant, MiniDisc is designed with a -tradeoff to measure the optimality of the teacher assistant without trial distillation to the student. MiniDisc then can schedule the optimal teacher assistant with the best -tradeoff in a sandwich framework. MiniDisc is evaluated with an extensive set of experiments on GLUE. Experimental results demonstrate the improved efficiency our MiniDisc compared to several state-of-the-art baselines. We further apply MiniDisc to a language model with billions of parameters and show its scalability.
Paper Structure (39 sections, 4 equations, 5 figures, 16 tables)

This paper contains 39 sections, 4 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: The impact of teacher assistants of different scales and performance on the performance of students. In the study, a BERT base model is used as the teacher and distilled to a pruned student (10% parameters of the teacher) via different teacher assistants MirzadehFLLMG20 on MRPC and QQP. There are several observations: (1) The blue curve shows that the performance of the teacher assistant degrades with the decreasing of its scale, which is obvious. (2) The green curve validates that the performance of the student varies with different teacher assistants. (3) The red curve represents $\lambda$-tradeoff of the teacher assistant, which is positively correlated with the performance of the student.
  • Figure 2: An overview of MiniDisc by contrasting it to MaxiDisc, where one arrow denotes a distillation step. MiniDisc uses only one trial while MaxiDisc uses many trials to schedule the optimal teacher assistant.
  • Figure 3: Tradeoff studies by distilling the teacher to a student at 5% scale. On the left hand, the blue curve represents the performance of teacher assistants at different scales. The green curve represents the performance of MaxiDisc using these teacher assistants. The red curve represents the $\lambda$-tradeoff value. The brown dashed line represents the performance of MiniDisc. On the right hand, the brown, orange, and purple bars represent the performance of MiniDisc using one, two, and three teacher assistants.
  • Figure 4: The distribution of example pruned structures. The structures are derived with MRPC dataset.
  • Figure 5: Performance comparisons among various schedules for EncT5. The dots represent performance variations using either one or two teacher assistants for MaxiDisc. The triangles represent performance resulting from MiniDisc using one teacher assistant. The rectangles represent performance resulting from MiniDisc using two teacher assistants.

Theorems & Definitions (3)

  • Definition 1: $\lambda$-tradeoff
  • Remark 1
  • Remark 2