Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland
TL;DR
The paper investigates diffusion-based large language models (DLLMs) for automatic speech recognition by examining two routes: using DLLMs as external deliberation modules to refine transcripts from a base ASR (Whisper-LLaMA) and employing DLLMs as internal diffusion-based decoders for end-to-end ASR. The authors introduce Whisper-LLaDA, which fuses a Whisper encoder with a diffusion-based LLaDA-8B-Instruct decoder and uses LoRA to enable efficient fine-tuning, achieving both deliberation-and-decoding capabilities conditioned on audio. Deliberation-based processing with audio-conditioned LLaDA yields consistent WER improvements, with random masking at high ratios and semi-autoregressive sub-block strategies providing the strongest gains; diffusion-based decoding offers faster inference with modest accuracy loss, while semi-autoregressive decoding can achieve the best overall results on LibriSpeech. These results highlight both the potential and current limitations of diffusion-based models in ASR, suggesting that scaling data and further masking strategy development could narrow the gap to state-of-the-art autoregressive systems while preserving efficiency.
Abstract
Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.
