Table of Contents
Fetching ...

Fun-ASR Technical Report

Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Ying Liu, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Haoxu Wang, Wen Wang, Wupeng Wang, Yuzhong Wu, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou, Yanqiao Zhu

TL;DR

Fun-ASR presents a large-scale, LLM-based ASR system that integrates massive data, expansive model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance in real-world deployment. The approach emphasizes production-oriented optimizations, including streaming capability, noise robustness, code-switching, hotword customization, and hallucination mitigation, validated across open benchmarks and industry datasets. Key innovations include a four-component architecture, Best-RQ and AED pre-training, contextual supervised fine-tuning, a GRPO-based RL framework (FunRL), and RAG-based hotword customization. The results demonstrate robust performance gains in streaming, noisy environments, multilingual settings, and domain-specific scenarios, underscoring Fun-ASR’s practical impact for deployable, high-accuracy ASR systems.

Abstract

In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings. The code and models are accessible at https://github.com/FunAudioLLM/Fun-ASR .

Fun-ASR Technical Report

TL;DR

Fun-ASR presents a large-scale, LLM-based ASR system that integrates massive data, expansive model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance in real-world deployment. The approach emphasizes production-oriented optimizations, including streaming capability, noise robustness, code-switching, hotword customization, and hallucination mitigation, validated across open benchmarks and industry datasets. Key innovations include a four-component architecture, Best-RQ and AED pre-training, contextual supervised fine-tuning, a GRPO-based RL framework (FunRL), and RAG-based hotword customization. The results demonstrate robust performance gains in streaming, noisy environments, multilingual settings, and domain-specific scenarios, underscoring Fun-ASR’s practical impact for deployable, high-accuracy ASR systems.

Abstract

In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings. The code and models are accessible at https://github.com/FunAudioLLM/Fun-ASR .

Paper Structure

This paper contains 34 sections, 3 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Performance comparison (in Accuracy) between our Fun-ASR (7.7B) and Fun-ASR-nano (0.8B) and top-tier ASR and speech-text multimodal models.
  • Figure 2: Overview of the Fun-ASR model architecture.
  • Figure 3: The pre-training pipeline for the audio encoder.
  • Figure 4: The overview and time consumption analysis for our FunRL framework.