Table of Contents
Fetching ...

Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher

Hyunjong Ok, Jegwang Ryu, Jaeho Lee

TL;DR

This work tackles decoding with limited teacher supervision by enabling an adaptive, entropy-aware fusion of a teacher LLM and a faster student sLLM for the initial tokens, without further training. It formalizes a logit-aggregation scheme $S_{\alpha} = \sigma(f_{s}(x)) + \alpha (\sigma(f_{t}(x)) - \sigma(f_{s}(x)))$ and shows that optimal $\alpha$ is highly task- and datum-dependent, challenging fixed-teacher strategies. The authors propose two components—an $\alpha$ predictor and entropy-based knowledge injection—that predict per-datum $\alpha$ and decide when to inject teacher information based on the student’s entropy. Across classification and diverse LLM benchmarks, the method consistently improves over baselines and can approach or surpass teacher performance with lower computational cost, offering practical benefits for edge devices and remote LLM access. The work contributes a practical, interpretable framework for limited supervision decoding and highlights the nuanced role of entropy in guiding knowledge injection.

Abstract

How can small-scale large language models (LLMs) efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under the $\textit{limited supervision}$ scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an algorithm to effectively aggregate the small-scale LLM and LLM predictions on initial tokens so that the generated tokens can more accurately condition the subsequent token generation by small-scale LLM only. Critically, we find that it is essential to adaptively overtrust or disregard the LLM prediction based on the confidence of the small-scale LLM. Through our experiments on a wide range of models and datasets, we demonstrate that our method provides a consistent improvement over conventional decoding strategies. $\small$ $\textbf{Code:}$ https://github.com/HJ-Ok/DecLimSup

Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher

TL;DR

This work tackles decoding with limited teacher supervision by enabling an adaptive, entropy-aware fusion of a teacher LLM and a faster student sLLM for the initial tokens, without further training. It formalizes a logit-aggregation scheme and shows that optimal is highly task- and datum-dependent, challenging fixed-teacher strategies. The authors propose two components—an predictor and entropy-based knowledge injection—that predict per-datum and decide when to inject teacher information based on the student’s entropy. Across classification and diverse LLM benchmarks, the method consistently improves over baselines and can approach or surpass teacher performance with lower computational cost, offering practical benefits for edge devices and remote LLM access. The work contributes a practical, interpretable framework for limited supervision decoding and highlights the nuanced role of entropy in guiding knowledge injection.

Abstract

How can small-scale large language models (LLMs) efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under the scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an algorithm to effectively aggregate the small-scale LLM and LLM predictions on initial tokens so that the generated tokens can more accurately condition the subsequent token generation by small-scale LLM only. Critically, we find that it is essential to adaptively overtrust or disregard the LLM prediction based on the confidence of the small-scale LLM. Through our experiments on a wide range of models and datasets, we demonstrate that our method provides a consistent improvement over conventional decoding strategies. https://github.com/HJ-Ok/DecLimSup
Paper Structure (48 sections, 3 equations, 7 figures, 17 tables, 1 algorithm)

This paper contains 48 sections, 3 equations, 7 figures, 17 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our methodology. As the parameter $\alpha$ increases, the methodology leverages a more significant disparity in knowledge between the teacher and the student models. The example shows a modification in the initial generated word from 'John' to 'Let,' which allows the sentence to generate the correct answer when subsequently generated with the student model.
  • Figure 2: Visualization of accuracy as a function of $\alpha$. The red dashed line indicates $\alpha$=1, and the orange line represents the student model's baseline performance.
  • Figure 3: Visualization of the number of correct answers performed by "Receive knowledge from the teacher" and "Generate solo" based on student entropy values utilizing Llama-2 on GSM8K. (a) represents cases with low entropy, while (b) shows cases with high entropy. The red dashed line indicates the threshold beyond which "Generate solo" demonstrates superior performance.
  • Figure 4: Illustration of the optimal $\alpha$ predict module.
  • Figure 5: Results of comparison of our method with CoT-decoding using the Phi-3. K denotes counts of exploring paths starting from top-$k$.
  • ...and 2 more figures