FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

Kaituo Xu; Yan Jia; Kai Huang; Junjie Chen; Wenpeng Li; Kun Liu; Feng-Long Xie; Xu Tang; Yao Hu

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao Hu

TL;DR

Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage, and the model weights and code are released at https://github.com/FireRedTeam/FireRedASR2S.

Abstract

We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at https://github.com/FireRedTeam/FireRedASR2S.

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

TL;DR

Abstract

Paper Structure (29 sections, 2 figures, 8 tables)

This paper contains 29 sections, 2 figures, 8 tables.

Introduction
FireRedASR2S: System Overview
FireRedASR2: Automatic Speech Recognition
FireRedASR2-AED: Attention-based Encoder-Decoder ASR model
Confidence estimation from decoder probabilities
Post-hoc CTC branch for timestamps
FireRedASR2-LLM: Encoder-Adapter-LLM-based ASR model
Summary of differences from FireRedASR
FireRedVAD: Voice Activity Detection
Tasks and label definitions
Training data
Model architecture
Post-processing and segmentation
FireRedLID: Hierarchical Spoken Language and Dialect Identification
Model and training
...and 14 more sections

Figures (2)

Figure 1: Overview of FireRedASR2S. The input waveform is processed sequentially by FireRedVAD (VAD), FireRedLID (LID), FireRedASR2 (ASR), and FireRedPunc (Punc) to produce structured transcription outputs, including punctuated text, timestamps, confidence scores, and language labels.
Figure 2: Architecture of FireRedASR2-AED (bottom left), FireRedASR2-LLM (right), and Adapter.

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

TL;DR

Abstract

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

Authors

TL;DR

Abstract

Table of Contents

Figures (2)