Table of Contents
Fetching ...

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Yinrong Hong, Zhiquan Tan, Kai Hu

TL;DR

This work tackles the inference latency of large language models by introducing CAST, a cost-aware dynamic tree decoding framework that accounts for hardware and batching when constructing and pruning the speculative draft tree. Building on EAGLE-2/3, CAST models the trade-off between accepted token throughput and inference cost, adapting tree depth and per-layer token counts to the current compute context. Extensive experiments across six tasks and six models show CAST achieving up to ~5.2x speedups and consistent 5–20% gains over existing speculative decoding methods, especially in larger-model and batching scenarios. The approach demonstrates practical impact for real-world LLM deployments by improving latency and throughput without modifying the underlying models or acceptance criteria.

Abstract

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

TL;DR

This work tackles the inference latency of large language models by introducing CAST, a cost-aware dynamic tree decoding framework that accounts for hardware and batching when constructing and pruning the speculative draft tree. Building on EAGLE-2/3, CAST models the trade-off between accepted token throughput and inference cost, adapting tree depth and per-layer token counts to the current compute context. Extensive experiments across six tasks and six models show CAST achieving up to ~5.2x speedups and consistent 5–20% gains over existing speculative decoding methods, especially in larger-model and batching scenarios. The approach demonstrates practical impact for real-world LLM deployments by improving latency and throughput without modifying the underlying models or acceptance criteria.

Abstract

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.

Paper Structure

This paper contains 24 sections, 2 theorems, 3 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

EAGLE-2 and EAGLE-3's selection algorithm in $i$-th layer is a special case of the proposed selection Algorithm by setting $c_j = \lambda j + \delta$ and $C_1 = \frac{\sum^B_{j=1} v^{(j)}_i(K)}{B \lambda}$.

Figures (5)

  • Figure 1: An illustrative example for the dynamic expansion stage, we use batch size as 1 for simplicity, general cases are tackled by averaging along batches. Each node will initially have 2 branches in the example.
  • Figure 2: An illustrative example for the dynamic reranking stage.
  • Figure 3: The correlation of accept length and cumulative probability.
  • Figure 4: The behavior of accept length and speedup ratio when varying the number of tokens to be verified by the target model using EAGLE-3.
  • Figure 5: The speedup under different batch sizes on HumanEval.

Theorems & Definitions (3)

  • Theorem 4.1
  • Theorem D.1
  • proof