Table of Contents
Fetching ...

Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition

Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei

TL;DR

The paper addresses robust end-to-end multi-accent ASR without relying on pre-defined accent labels. It introduces Qifusion-Net, a Conformer-based encoder augmented with a Layer-adapted fusion (LAF) module and a cross-attention fusion that injects frame-level accent cues, plus an Accent Identify Decoder for auxiliary supervision in a multi-task framework. By combining CTC and attention-based ASR losses with an accent identification loss, and employing dynamic chunk masking, the approach supports both streaming and non-streaming decoding. Empirically, it achieves substantial relative CER reductions on KeSpeech ($22.1\%$) and MagicData-RAMC ($17.2\%$) over baselines, while delivering strong AID performance, indicating practical viability for real-time, multi-accent ASR without accent priors.

Abstract

Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$\%$ and 17.2$\%$ in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.

Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition

TL;DR

The paper addresses robust end-to-end multi-accent ASR without relying on pre-defined accent labels. It introduces Qifusion-Net, a Conformer-based encoder augmented with a Layer-adapted fusion (LAF) module and a cross-attention fusion that injects frame-level accent cues, plus an Accent Identify Decoder for auxiliary supervision in a multi-task framework. By combining CTC and attention-based ASR losses with an accent identification loss, and employing dynamic chunk masking, the approach supports both streaming and non-streaming decoding. Empirically, it achieves substantial relative CER reductions on KeSpeech () and MagicData-RAMC () over baselines, while delivering strong AID performance, indicating practical viability for real-time, multi-accent ASR without accent priors.

Abstract

Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1 and 17.2 in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.
Paper Structure (16 sections, 12 equations, 2 figures, 3 tables)

This paper contains 16 sections, 12 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Schematic architecture of the proposed layer-adapted for end-to-end multi-accent ASR model.
  • Figure 2: Key parts of the model architecture.