Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher
Hyunjong Ok, Jegwang Ryu, Jaeho Lee
TL;DR
This work tackles decoding with limited teacher supervision by enabling an adaptive, entropy-aware fusion of a teacher LLM and a faster student sLLM for the initial tokens, without further training. It formalizes a logit-aggregation scheme $S_{\alpha} = \sigma(f_{s}(x)) + \alpha (\sigma(f_{t}(x)) - \sigma(f_{s}(x)))$ and shows that optimal $\alpha$ is highly task- and datum-dependent, challenging fixed-teacher strategies. The authors propose two components—an $\alpha$ predictor and entropy-based knowledge injection—that predict per-datum $\alpha$ and decide when to inject teacher information based on the student’s entropy. Across classification and diverse LLM benchmarks, the method consistently improves over baselines and can approach or surpass teacher performance with lower computational cost, offering practical benefits for edge devices and remote LLM access. The work contributes a practical, interpretable framework for limited supervision decoding and highlights the nuanced role of entropy in guiding knowledge injection.
Abstract
How can small-scale large language models (LLMs) efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under the $\textit{limited supervision}$ scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an algorithm to effectively aggregate the small-scale LLM and LLM predictions on initial tokens so that the generated tokens can more accurately condition the subsequent token generation by small-scale LLM only. Critically, we find that it is essential to adaptively overtrust or disregard the LLM prediction based on the confidence of the small-scale LLM. Through our experiments on a wide range of models and datasets, we demonstrate that our method provides a consistent improvement over conventional decoding strategies. $\small$ $\textbf{Code:}$ https://github.com/HJ-Ok/DecLimSup
