Table of Contents
Fetching ...

GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation

Mohsen Gholami, Mohammad Akbari, Cindy Hu, Vaden Masrani, Z. Jane Wang, Yong Zhang

TL;DR

GOLD tackles the bias of LLM-driven data generation in knowledge distillation by introducing an iterative, out-of-distribution (OOD) guided feedback loop and an energy-based OOD evaluation to identify informative, tail-end samples. By alternating between LLM-driven data generation and SLM training, and by selecting OOD samples via energy scores, GOLD enhances the generalization of distilled models across 10 NLP tasks, including novel ones like NL4OPT. The method combines a problem formulation that emphasizes stress-testing the SLM on challenging OOD data with a symmetric cross-entropy objective to mitigate label noise, achieving state-of-the-art or competitive results against prior data-generation approaches and LLM few-shot baselines. The work demonstrates practical effectiveness for task-agnostic KD and points to future extensions to other modalities and enhanced data valuation for reliability.

Abstract

Knowledge distillation from LLMs is essential for the efficient deployment of language models. Prior works have proposed data generation using LLMs for preparing distilled models. We argue that generating data with LLMs is prone to sampling mainly from the center of original content distribution. This limitation hinders the distilled model from learning the true underlying data distribution and to forget the tails of the distributions (samples with lower probability). To this end, we propose GOLD, a task-agnostic data generation and knowledge distillation framework, which employs an iterative out-of-distribution-guided feedback mechanism for the LLM. As a result, the generated data improves the generalizability of distilled models. An energy-based OOD evaluation approach is also introduced to deal with noisy generated data. Our extensive experiments on 10 different classification and sequence-to-sequence tasks in NLP show that GOLD respectively outperforms prior arts and the LLM with an average improvement of 5% and 14%. We will also show that the proposed method is applicable to less explored and novel tasks. The code is available.

GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation

TL;DR

GOLD tackles the bias of LLM-driven data generation in knowledge distillation by introducing an iterative, out-of-distribution (OOD) guided feedback loop and an energy-based OOD evaluation to identify informative, tail-end samples. By alternating between LLM-driven data generation and SLM training, and by selecting OOD samples via energy scores, GOLD enhances the generalization of distilled models across 10 NLP tasks, including novel ones like NL4OPT. The method combines a problem formulation that emphasizes stress-testing the SLM on challenging OOD data with a symmetric cross-entropy objective to mitigate label noise, achieving state-of-the-art or competitive results against prior data-generation approaches and LLM few-shot baselines. The work demonstrates practical effectiveness for task-agnostic KD and points to future extensions to other modalities and enhanced data valuation for reliability.

Abstract

Knowledge distillation from LLMs is essential for the efficient deployment of language models. Prior works have proposed data generation using LLMs for preparing distilled models. We argue that generating data with LLMs is prone to sampling mainly from the center of original content distribution. This limitation hinders the distilled model from learning the true underlying data distribution and to forget the tails of the distributions (samples with lower probability). To this end, we propose GOLD, a task-agnostic data generation and knowledge distillation framework, which employs an iterative out-of-distribution-guided feedback mechanism for the LLM. As a result, the generated data improves the generalizability of distilled models. An energy-based OOD evaluation approach is also introduced to deal with noisy generated data. Our extensive experiments on 10 different classification and sequence-to-sequence tasks in NLP show that GOLD respectively outperforms prior arts and the LLM with an average improvement of 5% and 14%. We will also show that the proposed method is applicable to less explored and novel tasks. The code is available.
Paper Structure (20 sections, 8 equations, 4 figures, 16 tables)

This paper contains 20 sections, 8 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: GOLD finds failure modes of SLM in the course of data generation and guides the LLM to generate OOD samples to improve SLM's generalizability.
  • Figure 2: Overview of the proposed data generation and knowledge distillation method, GOLD. $\oplus$: Concatenation.
  • Figure 3: The distribution of generated data by our method with and without OOD-based feedback.
  • Figure 4: The distribution of generated data by our method with and without OOD feedback.