Table of Contents
Fetching ...

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

Junhao Xu, Zhenlin Liang, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang

TL;DR

MooER introduces an LLM-based framework for end-to-end ASR and AST trained on a compact 5k-hour, pseudo-labeled dataset, achieving competitive Mandarin and English speech recognition and strong translation performance while running on domestic GPU hardware. The architecture combines a Paraformer encoder, a fusion-oriented adapter, and a Qwen2-7B-instruct LLM, with only 2% of LLM parameters updated via LoRA, enabled by DeepSpeed-based optimizations. Key contributions include a practical training strategy that leverages pseudo labels, a demonstrated BLEU of 25.2 on Covost2 Zh2en, and CER/WER improvements that scale with larger internal data (80k hours). The work emphasizes industrial applicability, rapid domain adaptation, and plans to open-source both models and training methodology to advance speech-language-model systems in resource-constrained settings.

Abstract

In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

TL;DR

MooER introduces an LLM-based framework for end-to-end ASR and AST trained on a compact 5k-hour, pseudo-labeled dataset, achieving competitive Mandarin and English speech recognition and strong translation performance while running on domestic GPU hardware. The architecture combines a Paraformer encoder, a fusion-oriented adapter, and a Qwen2-7B-instruct LLM, with only 2% of LLM parameters updated via LoRA, enabled by DeepSpeed-based optimizations. Key contributions include a practical training strategy that leverages pseudo labels, a demonstrated BLEU of 25.2 on Covost2 Zh2en, and CER/WER improvements that scale with larger internal data (80k hours). The work emphasizes industrial applicability, rapid domain adaptation, and plans to open-source both models and training methodology to advance speech-language-model systems in resource-constrained settings.

Abstract

In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.
Paper Structure (16 sections, 1 figure, 9 tables)