Table of Contents
Fetching ...

Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

Jaeyoung Lee, Masato Mimura

TL;DR

This paper tackles the challenge of unifying speech and text within a decoder-only ASR model by introducing a modality-aware sparse mixture-of-experts (MoE) in a decoder-only Conformer. The architecture partitions experts into speech and text pools with hard, top-1 routing and employs hybrid causal blocks to balance bidirectional speech modeling with autoregressive text generation, all without external encoders or pretrained LLMs. Training combines CTC losses on speech positions with label-smoothed cross-entropy for text, aided by a load-balancing term to stabilize routing. Empirical results on LibriSpeech and Common Voice show competitive or superior WER with fewer active parameters than AED baselines, demonstrating the effectiveness of modality-aware MoE for efficient, unified speech-text modeling in ASR. The approach has practical implications for parameter-efficient, streaming-capable, decoder-focused ASR systems and sets a path for further exploration of learned routing and cross-modality sharing.

Abstract

We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.

Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

TL;DR

This paper tackles the challenge of unifying speech and text within a decoder-only ASR model by introducing a modality-aware sparse mixture-of-experts (MoE) in a decoder-only Conformer. The architecture partitions experts into speech and text pools with hard, top-1 routing and employs hybrid causal blocks to balance bidirectional speech modeling with autoregressive text generation, all without external encoders or pretrained LLMs. Training combines CTC losses on speech positions with label-smoothed cross-entropy for text, aided by a load-balancing term to stabilize routing. Empirical results on LibriSpeech and Common Voice show competitive or superior WER with fewer active parameters than AED baselines, demonstrating the effectiveness of modality-aware MoE for efficient, unified speech-text modeling in ASR. The approach has practical implications for parameter-efficient, streaming-capable, decoder-focused ASR systems and sets a path for further exploration of learned routing and cross-modality sharing.

Abstract

We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.
Paper Structure (23 sections, 15 equations, 2 figures, 2 tables)

This paper contains 23 sections, 15 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the proposed decoder-only Conformer with modality-aware MoE. Expert pools are partitioned by modality, where speech tokens route only to speech experts and text tokens to text experts. Each router chooses a single expert (top-1) within the corresponding pool.
  • Figure 2: Non-causal and causal masks are applied to speech and text representations, respectively, across all convolution and self-attention layers.