Table of Contents
Fetching ...

Knowledge Distillation of Black-Box Large Language Models

Hongzhan Chen, Ruijun Chen, Yuqi Yi, Xiaojun Quan, Chenliang Li, Ming Yan, Ji Zhang

TL;DR

Proxy-KD is introduced, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models and surpasses traditional white-box KD techniques.

Abstract

Given the exceptional performance of proprietary large language models (LLMs) like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teachers. While leveraging the high-quality outputs of these teachers is advantageous, the inaccessibility of their internal states often limits effective knowledge transfer. To overcome this limitation, we introduce Proxy-KD, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models. Our experiments show that Proxy-KD not only enhances the performance of KD from black-box teacher models but also surpasses traditional white-box KD techniques.~This approach presents a compelling new avenue for distilling knowledge from advanced LLMs.

Knowledge Distillation of Black-Box Large Language Models

TL;DR

Proxy-KD is introduced, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models and surpasses traditional white-box KD techniques.

Abstract

Given the exceptional performance of proprietary large language models (LLMs) like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teachers. While leveraging the high-quality outputs of these teachers is advantageous, the inaccessibility of their internal states often limits effective knowledge transfer. To overcome this limitation, we introduce Proxy-KD, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models. Our experiments show that Proxy-KD not only enhances the performance of KD from black-box teacher models but also surpasses traditional white-box KD techniques.~This approach presents a compelling new avenue for distilling knowledge from advanced LLMs.
Paper Structure (27 sections, 11 equations, 7 figures, 3 tables)

This paper contains 27 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of white-box knowledge distillation (KD) and black-box knowledge distillation (KD).
  • Figure 2: Overview of our proposed Proxy-based Knowledge Distillation (Proxy-KD).
  • Figure 3: Performance of student models under different proxy models. We also show the ratio of performance gap between the proxy models and the student models.
  • Figure 4: The statistics of the cumulative probability within the Top K exceeding 0.95. The x-axis represents different values of K, while the y-axis shows the percentage of instances meeting this threshold.
  • Figure 5: The match ratio between the proxy and teacher's output tokens before and after alignment. If the top-1 token given by the proxy equals the token given by the teacher in a current step, it is considered a match; otherwise, it is considered a mismatch..
  • ...and 2 more figures