Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone
Vaibhav Singh, Oleksiy Ostapenko, Pierre-André Noël, Torsten Scholak
TL;DR
DiffuApriel addresses the inefficiency of diffusion language models that rely on Transformer backbones by replacing attention with bidirectional state-space recurrences, achieving linear-time inference with a memory-efficient denoiser. The authors introduce DiffuApriel and DiffuApriel-H, both designed for masked discrete diffusion, and show that they match or surpass Transformer-based diffusion models in quality while delivering substantial throughput gains on up to 1.3B parameters (up to 4.4x). The work also demonstrates that a hybrid architecture can combine the strengths of state-space models and attention to improve generalization, with notable perplexity improvements and competitive zero-shot performance. This suggests a practical path to scalable, fast diffusion-based text generation with memory efficiency for long-context inference.
Abstract
Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.
