Table of Contents
Fetching ...

Drax: Speech Recognition with Discrete Flow Matching

Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya

TL;DR

Drax introduces a discrete flow matching framework for non-autoregressive ASR by incorporating a tri-mixture probability path that includes an audio-conditioned middle distribution to better align training and inference. The approach leverages a Whisper encoder with a DiT decoder and a trainable middle component, trained with Gumbel-Softmax and a combined loss, and enables parallel, controllable decoding with candidate scoring and speculative decoding. Theoretical analysis links generalization to occupancy divergences between training and inference, motivating the path design. Empirically, Drax achieves competitive accuracy to state-of-the-art baselines while offering favorable accuracy-efficiency trade-offs across multilingual benchmarks and demonstrating strong performance with speculative decoding.

Abstract

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

Drax: Speech Recognition with Discrete Flow Matching

TL;DR

Drax introduces a discrete flow matching framework for non-autoregressive ASR by incorporating a tri-mixture probability path that includes an audio-conditioned middle distribution to better align training and inference. The approach leverages a Whisper encoder with a DiT decoder and a trainable middle component, trained with Gumbel-Softmax and a combined loss, and enables parallel, controllable decoding with candidate scoring and speculative decoding. Theoretical analysis links generalization to occupancy divergences between training and inference, motivating the path design. Empirically, Drax achieves competitive accuracy to state-of-the-art baselines while offering favorable accuracy-efficiency trade-offs across multilingual benchmarks and demonstrating strong performance with speculative decoding.

Abstract

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

Paper Structure

This paper contains 51 sections, 3 theorems, 67 equations, 9 figures, 11 tables.

Key Result

Corollary 1

For almost every $t\in[0,1]$:

Figures (9)

  • Figure 1: The Drax framework: (a) During training, our probability path involves a mixture of three components: a source uniform distribution, the target data distribution, and an audio conditioned distribution. (b) At inference, generation starts from noise tokens and iteratively follows the learned flow to the target sequence, passing through plausible intermediate hypotheses. (c) Drax combines an audio encoder with a DiT-based decoder.
  • Figure 2: Accuracy-efficiency trade-off: (a) The RTFx as a function of sequence length. (b) The Pareto front of the WER and RTF (%) (i.e., 100/RTFx). The Drax varients provide favorable accuracy-efficiency trade-off with better control over the trade-off point.
  • Figure 3: Training path design. Comparison of training curves under different paths.
  • Figure 4: Tri-mixture sampling scheduler.
  • Figure 5: Runtime comparison for Drax-flash.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Claim 1: TV stability of path marginals
  • Corollary 1: Instantaneous TV growth
  • Theorem 1: Generalization bound via occupancy TV
  • proof
  • proof
  • Proposition 1: From path-marginal TV to occupancy TV
  • proof
  • proof