Table of Contents
Fetching ...

Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Yao Fu, Yin Yu, Xiaotian Han, Runchao Li, Xianxuan Long, Haotian Yu, Pan Li

TL;DR

A model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations' distillation from the last ones' generated logits and is a novel fine-tuning policy that facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs) because they all require updating SLMs' parameters.

Abstract

Knowledge distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprints. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self-distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers' guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations' distillation from the last ones' generated logits. Additionally, to address prediction inaccuracies during the early iterations, we dynamically adjust the distillation influence and temperature values to enhance the adaptability of fine-tuning. Furthermore, DynSDPB is a novel fine-tuning policy that facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs) because they all require updating SLMs' parameters. We demonstrate the superior performance of DynSDPB on both encoder-only LMs (e.g., BERT model families) and decoder-only LMs (e.g., LLaMA model families), validating its effectiveness across natural language understanding (NLU) and natural language generation (NLG) benchmarks.

Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

TL;DR

A model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations' distillation from the last ones' generated logits and is a novel fine-tuning policy that facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs) because they all require updating SLMs' parameters.

Abstract

Knowledge distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprints. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self-distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers' guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations' distillation from the last ones' generated logits. Additionally, to address prediction inaccuracies during the early iterations, we dynamically adjust the distillation influence and temperature values to enhance the adaptability of fine-tuning. Furthermore, DynSDPB is a novel fine-tuning policy that facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs) because they all require updating SLMs' parameters. We demonstrate the superior performance of DynSDPB on both encoder-only LMs (e.g., BERT model families) and decoder-only LMs (e.g., LLaMA model families), validating its effectiveness across natural language understanding (NLU) and natural language generation (NLG) benchmarks.

Paper Structure

This paper contains 60 sections, 10 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Two types of distillation. (a) displays the classical knowledge distillation (KD) framework that requires a teacher model. (b) outlines our dynamic SelfD from the previous mini-batch (DynSDPB), where we just let student models distill knowledge from itself via the last iteration's information. Considering that students are evolving during distillation, we design a mechanism to dynamically adjust $\tau$ in Eq. (\ref{['eq:last_batch_consistency_loss']}) and $\alpha$ in Eq. (\ref{['eq:our_selfd_loss']}). CE means cross-entropy, KLD means Kullback-Leibler Divergence, and FC means fully connected layers.
  • Figure 2: The logarithmic-scale gradient norms of selected layers for DeBERTa-large fine-tuning in two ways. The gradients of all parameters within one layer are averaged into a scalar value, whose values' changes are tracked throughout fine-tuning iterations. We observe that for vanilla fine-tuning, the gradients of shallow layers vanish by the end of the process. However, the robust gradients always exist to benefit fine-tuning if applying dynamic SelfD.
  • Figure 3: The heatmap evaluation on hyperparameters (temperature $\tau$ and balancing factor $\alpha$) for static SelfD (Random DLB) for DeBERTa-v3-large on BoolQ and RTE.
  • Figure 4: The logarithmic-scale gradient norms of selected layers for DeBERTa-v3-large fine-tuning in two ways. The gradients of all parameters within one layer are averaged into a scalar value, whose values' changes are tracked throughout fine-tuning iterations. We observe that for vanilla fine-tuning, the gradients of shallow layers vanish by the end of the process. However, the robust gradients always exist to benefit fine-tuning if applying dynamic SelfD.
  • Figure 5: The logarithmic-scale gradient norms of selected layers for RoBERTa-base fine-tuning in two ways. The gradients of all parameters within one layer are averaged into a scalar value, whose values' changes are tracked throughout fine-tuning iterations. We observe that for vanilla fine-tuning, the gradients of shallow layers vanish by the end of the process. However, the robust gradients always exist to benefit fine-tuning if applying dynamic SelfD.