Table of Contents
Fetching ...

TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

Jaber Jaber, Osama Jaber

Abstract

Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: https://github.com/RightNow-AI/TIDE

TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

Abstract

Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: https://github.com/RightNow-AI/TIDE
Paper Structure (27 sections, 2 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 27 sections, 2 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Tide system overview. Left: one-time calibration collects hidden states from a frozen model, computes per-token cosine similarity to the final layer, and trains a binary router MLP at each checkpoint. Right: at inference, the full forward pass runs (preserving the KV cache), then routers evaluate each checkpoint post-hoc and select the earliest converged layer for logit computation. Numbered steps show execution order.
  • Figure 2: Per-token early exit in a 32-layer model. Token "the" converges at layer 8 and skips the remaining 24 layers. Token "cat" exits at layer 16. Only "sat" requires full depth. Tide evaluates the routers $\phi_k$ post-hoc after the full forward pass completes.