Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu
TL;DR
This work tackles the inference latency of large language models by introducing CAST, a cost-aware dynamic tree decoding framework that accounts for hardware and batching when constructing and pruning the speculative draft tree. Building on EAGLE-2/3, CAST models the trade-off between accepted token throughput and inference cost, adapting tree depth and per-layer token counts to the current compute context. Extensive experiments across six tasks and six models show CAST achieving up to ~5.2x speedups and consistent 5–20% gains over existing speculative decoding methods, especially in larger-model and batching scenarios. The approach demonstrates practical impact for real-world LLM deployments by improving latency and throughput without modifying the underlying models or acceptance criteria.
Abstract
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.
