Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference
Toby Simonds
TL;DR
The paper introduces Entropy Adaptive Decoding (EAD), a method that dynamically switches between a small and a large language model during inference based on prediction uncertainty quantified by rolling entropy $\bar{H}_t$. By evaluating $\bar{H}_t$ against a threshold $\tau$ within a window of size $w$, EAD assigns tokens to the smaller or larger model, trading exact output fidelity for substantial compute savings. Empirical results on the MATH benchmark show strong efficiency gains across model families, with the LLaMA pair achieving up to 96.7% of the large model's performance while using only about 43% of tokens, and the Qwen pair achieving around 92.9% performance with only 25% token usage; cost reductions reach as high as 67% in larger size differentials. These findings suggest that adaptive, entropy-guided resource allocation can significantly reduce inference costs while maintaining most of the benefits of large-scale models, challenging the necessity of perfect output fidelity in many practical settings.
Abstract
We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference that dynamically switches between different-sized models based on prediction uncertainty. By monitoring rolling entropy in model logit distributions, our method identifies text regions where a smaller model suffices and switches to a larger model only when prediction uncertainty exceeds a threshold. Unlike speculative decoding approaches that maintain perfect output fidelity through verification, EAD accepts controlled output divergence in exchange for computational efficiency. Our experiments on the MATH benchmark demonstrate remarkable efficiency gains across different model families. Using the LLaMA family, we maintain 96.7\% of the 11B model's performance (50.4\% vs 52.1\%) while using it for only 43\% of tokens, decreasing computational cost by 41.5\%. These gains become more pronounced with larger size differentials in the Qwen family, where we achieve 92.9\% of the 14B model's performance (74.3\% vs 80.0\%) while using it for just 25\% of tokens, decreasing computational cost by 67\%. The consistency of these results across model pairs suggests that language model computation can be significantly optimized by selectively deploying model capacity based on local generation complexity. Our findings indicate that current approaches to model inference may be unnecessarily conservative in their pursuit of perfect output fidelity, and that accepting minor performance trade-offs can enable dramatic reductions in computational costs.
