Table of Contents
Fetching ...

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

TL;DR

This work tackles the memory-footprint crisis in production diffusion LLMs by introducing dLLM-Serve, a holistic serving system that co-designs memory budgeting, phase-aware scheduling, and head-centric sparsity to handle oscillating activation patterns. It introduces Logit-Aware Activation Budgeting to bound peak activation, Phase-Multiplexed Scheduling to interleave heavy Refresh with light Reuse work, and Head-Centric Sparse KV Cache Management to align per-head sparsity with physical storage. An offline memory profiler additionally informs conservative yet efficient budgeting, enabling significantly higher concurrent throughput with dramatically reduced tail latency across hierarchical hardware, as demonstrated on RTX 4090 and NVIDIA L40S (up to 1.81x and 3.12x throughput gains, respectively). The results establish dLLM-Serve as the first blueprint for scalable production diffusion LLM inference and are complemented by open-source release to encourage broader adoption and further research.

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61$\times$-1.81$\times$ on the consumer-grade RTX 4090 and 1.60$\times$-1.74$\times$ on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4$\times$ under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware. The code is available at https://github.com/chosen-ox/dLLM-Serve.

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

TL;DR

This work tackles the memory-footprint crisis in production diffusion LLMs by introducing dLLM-Serve, a holistic serving system that co-designs memory budgeting, phase-aware scheduling, and head-centric sparsity to handle oscillating activation patterns. It introduces Logit-Aware Activation Budgeting to bound peak activation, Phase-Multiplexed Scheduling to interleave heavy Refresh with light Reuse work, and Head-Centric Sparse KV Cache Management to align per-head sparsity with physical storage. An offline memory profiler additionally informs conservative yet efficient budgeting, enabling significantly higher concurrent throughput with dramatically reduced tail latency across hierarchical hardware, as demonstrated on RTX 4090 and NVIDIA L40S (up to 1.81x and 3.12x throughput gains, respectively). The results establish dLLM-Serve as the first blueprint for scalable production diffusion LLM inference and are complemented by open-source release to encourage broader adoption and further research.

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61-1.81 on the consumer-grade RTX 4090 and 1.60-1.74 on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4 under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware. The code is available at https://github.com/chosen-ox/dLLM-Serve.

Paper Structure

This paper contains 30 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: dLLM-Serve Overview of dLLM-Serve.
  • Figure 2: Standard profiling vs. logit-aware profiling.
  • Figure 3: Throughput scalability under increasing arrival rates on the RTX 4090. The plot compares the realized serving throughput (Y-axis) against the request arrival rate (X-axis) for LLaDA-8B.
  • Figure 4: Latency-load comparison of dLLM-Serve against three state-of-the-art baselines on the RTX 4090. We report the average end-to-end latency across LiveBench, Burst, and OSC datasets as request arrival rates increase.
  • Figure 5: Jitter & predictability under high load (RTX 4090, RPS=0.5). Bars show best-baseline-normalized gains (baseline $=1.0$, red dashed) for latency standard deviation $\sigma$ and tail span $(\max-\min)$ on LiveBench/Burst/OSC; higher is better (less jitter).
  • ...and 3 more figures