Table of Contents
Fetching ...

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

Jiyoung Park, Hankyu Jang, Changseok Song, Wookeun Jung

TL;DR

The paper addresses the challenge of maintaining draft–target alignment for speculative decoding under non-stationary inference workloads. It introduces TIDE, a serving-engine-native framework that uses target-model intermediate hidden states as training signals so draft adaptation incurs zero inference overhead. Key innovations include adaptive runtime control for speculation and selective training, plus decoupled inference and training on heterogeneous GPUs. Empirical evaluation shows up to 1.15x throughput gains and up to 1.67x faster draft training compared with recomputing signals, with performance varying by workload, and demonstrates efficiency in heterogeneous deployments.

Abstract

Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

TL;DR

The paper addresses the challenge of maintaining draft–target alignment for speculative decoding under non-stationary inference workloads. It introduces TIDE, a serving-engine-native framework that uses target-model intermediate hidden states as training signals so draft adaptation incurs zero inference overhead. Key innovations include adaptive runtime control for speculation and selective training, plus decoupled inference and training on heterogeneous GPUs. Empirical evaluation shows up to 1.15x throughput gains and up to 1.67x faster draft training compared with recomputing signals, with performance varying by workload, and demonstrates efficiency in heterogeneous deployments.

Abstract

Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.
Paper Structure (23 sections, 6 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of TIDE architecture and workflow.
  • Figure 2: Detailed TIDE architecture and workflow. The system monitors acceptance length to adaptively enable/disable speculative decoding and selectively trigger training signal collection based on workload changes.
  • Figure 3: TIDE's asynchronous adaptation pipeline. Hidden state extraction overlaps with GPU computation (top), and draft model training proceeds in parallel with inference serving (bottom), achieving zero-overhead continuous adaptation.
  • Figure 4: Ratio of verification latency $T(b(\gamma+1))$ for $\gamma=3$ candidate tokens to single-token decoding latency $T(b)$ across different batch sizes. If decoding is completely memory-bound, this ratio would be 1.0 (ideal case shown by dashed line).
  • Figure 5: Accept length evolution during draft model training across four datasets using gpt-oss-120b as the target model. Accept length measures the average number of tokens accepted per speculative decoding step. Each time step corresponds to 30 seconds of training time.
  • ...and 7 more figures