Table of Contents
Fetching ...

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland

TL;DR

The paper investigates diffusion-based large language models (DLLMs) for automatic speech recognition by examining two routes: using DLLMs as external deliberation modules to refine transcripts from a base ASR (Whisper-LLaMA) and employing DLLMs as internal diffusion-based decoders for end-to-end ASR. The authors introduce Whisper-LLaDA, which fuses a Whisper encoder with a diffusion-based LLaDA-8B-Instruct decoder and uses LoRA to enable efficient fine-tuning, achieving both deliberation-and-decoding capabilities conditioned on audio. Deliberation-based processing with audio-conditioned LLaDA yields consistent WER improvements, with random masking at high ratios and semi-autoregressive sub-block strategies providing the strongest gains; diffusion-based decoding offers faster inference with modest accuracy loss, while semi-autoregressive decoding can achieve the best overall results on LibriSpeech. These results highlight both the potential and current limitations of diffusion-based models in ASR, suggesting that scaling data and further masking strategy development could narrow the gap to state-of-the-art autoregressive systems while preserving efficiency.

Abstract

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

TL;DR

The paper investigates diffusion-based large language models (DLLMs) for automatic speech recognition by examining two routes: using DLLMs as external deliberation modules to refine transcripts from a base ASR (Whisper-LLaMA) and employing DLLMs as internal diffusion-based decoders for end-to-end ASR. The authors introduce Whisper-LLaDA, which fuses a Whisper encoder with a diffusion-based LLaDA-8B-Instruct decoder and uses LoRA to enable efficient fine-tuning, achieving both deliberation-and-decoding capabilities conditioned on audio. Deliberation-based processing with audio-conditioned LLaDA yields consistent WER improvements, with random masking at high ratios and semi-autoregressive sub-block strategies providing the strongest gains; diffusion-based decoding offers faster inference with modest accuracy loss, while semi-autoregressive decoding can achieve the best overall results on LibriSpeech. These results highlight both the potential and current limitations of diffusion-based models in ASR, suggesting that scaling data and further masking strategy development could narrow the gap to state-of-the-art autoregressive systems while preserving efficiency.

Abstract

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.

Paper Structure

This paper contains 15 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A flowchart illustrating the proposed ASR system with audio-conditioned LLaDA diffusion LLM used for deliberation processing.
  • Figure 2: Overview of decoding and deliberation-based processing strategies. (a) Diffusion-based decoding: generate the full response in parallel by iterative denoising. (b) Semi-autoregressive decoding: split the response into sub-blocks, apply diffusion within each, and proceed autoregressively across sub-blocks. (c) Diffusion-based deliberation: refine Whisper-LLaMA transcripts by randomly masking or masking low-confidence tokens, and then reconstructing them through diffusion. (d) Semi-autoregressive deliberation: refine transcripts in sub-blocks, combining diffusion within each sub-block and autoregression across sub-blocks.
  • Figure 3: Effect of the number of denoising steps and the number of sub-blocks on WER for (a) test-clean and (b) test-other.