Table of Contents
Fetching ...

Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone

Vaibhav Singh, Oleksiy Ostapenko, Pierre-André Noël, Torsten Scholak

TL;DR

DiffuApriel addresses the inefficiency of diffusion language models that rely on Transformer backbones by replacing attention with bidirectional state-space recurrences, achieving linear-time inference with a memory-efficient denoiser. The authors introduce DiffuApriel and DiffuApriel-H, both designed for masked discrete diffusion, and show that they match or surpass Transformer-based diffusion models in quality while delivering substantial throughput gains on up to 1.3B parameters (up to 4.4x). The work also demonstrates that a hybrid architecture can combine the strengths of state-space models and attention to improve generalization, with notable perplexity improvements and competitive zero-shot performance. This suggests a practical path to scalable, fast diffusion-based text generation with memory efficiency for long-context inference.

Abstract

Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.

Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone

TL;DR

DiffuApriel addresses the inefficiency of diffusion language models that rely on Transformer backbones by replacing attention with bidirectional state-space recurrences, achieving linear-time inference with a memory-efficient denoiser. The authors introduce DiffuApriel and DiffuApriel-H, both designed for masked discrete diffusion, and show that they match or surpass Transformer-based diffusion models in quality while delivering substantial throughput gains on up to 1.3B parameters (up to 4.4x). The work also demonstrates that a hybrid architecture can combine the strengths of state-space models and attention to improve generalization, with notable perplexity improvements and competitive zero-shot performance. This suggests a practical path to scalable, fast diffusion-based text generation with memory efficiency for long-context inference.

Abstract

Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.

Paper Structure

This paper contains 14 sections, 9 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Schematic diagram of our proposed DiffuApriel architecture where mixer blocks replaces attention layers with bidirectional Mamba layers. In our experiments, to maintain comparability with DiffuTran, we treat the MLP layer as optional and refer to this variant as DiffuApriel+MLP. For DiffuApriel-H we have interleaved attention layers after every $K$ mamba layers. Attention provides global token interactions while Mamba enables efficient state space sequence modeling, allowing the hybrid denoiser to capture both long-range dependencies and local temporal dynamics with significantly improved efficiency. In our experiments, we fix $K=5$.
  • Figure 2: Inference throughput and Model's Latency per forward pass vs. sequence length with a batch size of 1 and constant 128 decoding steps. At 1.3B scale, DiffuApriel+MLP and DiffuApriel-H+MLP yields 4.4$\times$ and 2.6$\times$ throughput improvement over DiffuTran respectively. Further DiffuTran+KV caching (block size = 32) fastdllm boosts throughput up to 2048 tokens but degrades beyond 16K, eventually underperforming DiffuTran, consistent with peng2025efficient.