Knowledge Distillation of Black-Box Large Language Models

Hongzhan Chen; Ruijun Chen; Yuqi Yi; Xiaojun Quan; Chenliang Li; Ming Yan; Ji Zhang

Knowledge Distillation of Black-Box Large Language Models

Hongzhan Chen, Ruijun Chen, Yuqi Yi, Xiaojun Quan, Chenliang Li, Ming Yan, Ji Zhang

TL;DR

Proxy-KD is introduced, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models and surpasses traditional white-box KD techniques.

Abstract

Given the exceptional performance of proprietary large language models (LLMs) like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teachers. While leveraging the high-quality outputs of these teachers is advantageous, the inaccessibility of their internal states often limits effective knowledge transfer. To overcome this limitation, we introduce Proxy-KD, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models. Our experiments show that Proxy-KD not only enhances the performance of KD from black-box teacher models but also surpasses traditional white-box KD techniques.~This approach presents a compelling new avenue for distilling knowledge from advanced LLMs.

Knowledge Distillation of Black-Box Large Language Models

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 7 figures, 3 tables)

This paper contains 27 sections, 11 equations, 7 figures, 3 tables.

Introduction
Related Work
White-Box Knowledge Distillation
Black-Box Knowledge Distillation
Connection with Teacher Assistant
Method
Problem Statement
Preliminary
Hard-Label Knowledge Distillation.
Soft-Label Knowledge Distillation.
Proxy Model Alignment
Knowledge Distillation
Experimental Setup
Models and Datasets
Training Corpus.
...and 12 more sections

Figures (7)

Figure 1: Comparison of white-box knowledge distillation (KD) and black-box knowledge distillation (KD).
Figure 2: Overview of our proposed Proxy-based Knowledge Distillation (Proxy-KD).
Figure 3: Performance of student models under different proxy models. We also show the ratio of performance gap between the proxy models and the student models.
Figure 4: The statistics of the cumulative probability within the Top K exceeding 0.95. The x-axis represents different values of K, while the y-axis shows the percentage of instances meeting this threshold.
Figure 5: The match ratio between the proxy and teacher's output tokens before and after alignment. If the top-1 token given by the proxy equals the token given by the teacher in a current step, it is considered a match; otherwise, it is considered a mismatch..
...and 2 more figures

Knowledge Distillation of Black-Box Large Language Models

TL;DR

Abstract

Knowledge Distillation of Black-Box Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)