Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving
Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos
TL;DR
This work tackles the memory-footprint crisis in production diffusion LLMs by introducing dLLM-Serve, a holistic serving system that co-designs memory budgeting, phase-aware scheduling, and head-centric sparsity to handle oscillating activation patterns. It introduces Logit-Aware Activation Budgeting to bound peak activation, Phase-Multiplexed Scheduling to interleave heavy Refresh with light Reuse work, and Head-Centric Sparse KV Cache Management to align per-head sparsity with physical storage. An offline memory profiler additionally informs conservative yet efficient budgeting, enabling significantly higher concurrent throughput with dramatically reduced tail latency across hierarchical hardware, as demonstrated on RTX 4090 and NVIDIA L40S (up to 1.81x and 3.12x throughput gains, respectively). The results establish dLLM-Serve as the first blueprint for scalable production diffusion LLM inference and are complemented by open-source release to encourage broader adoption and further research.
Abstract
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61$\times$-1.81$\times$ on the consumer-grade RTX 4090 and 1.60$\times$-1.74$\times$ on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4$\times$ under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware. The code is available at https://github.com/chosen-ox/dLLM-Serve.
