Table of Contents
Fetching ...

Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages

Yangyang Meng, Jinpeng Li, Guodong Lin, Yu Pu, Guanbo Wang, Hu Du, Zhiming Shao, Yukai Huang, Ke Li, Wei-Qiang Zhang

TL;DR

Dolphin is a large-scale multilingual ASR system designed to improve recognition for 40 Eastern languages and 22 Chinese dialects by extending the Whisper framework. It combines proprietary Dataocean AI data with open-source corpora in a joint CTC-Attention architecture using an E-Branchformer encoder and region-aware two-level language tokens to capture dialectal variation. The study demonstrates substantial WER improvements over Whisper across multiple datasets and model sizes, supported by data-processing enhancements and memory-management techniques that enable efficient training on large-scale data. By releasing Dolphin base and small models with inference code, the work promotes reproducibility and community-driven progress in Eastern-language ASR.

Abstract

This report introduces Dolphin, a large-scale multilingual automatic speech recognition (ASR) model that extends the Whisper architecture to support a wider range of languages. Our approach integrates in-house proprietary and open-source datasets to refine and optimize Dolphin's performance. The model is specifically designed to achieve notable recognition accuracy for 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. Experimental evaluations show that Dolphin significantly outperforms current state-of-the-art open-source models across various languages. To promote reproducibility and community-driven innovation, we are making our trained models and inference source code publicly available.

Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages

TL;DR

Dolphin is a large-scale multilingual ASR system designed to improve recognition for 40 Eastern languages and 22 Chinese dialects by extending the Whisper framework. It combines proprietary Dataocean AI data with open-source corpora in a joint CTC-Attention architecture using an E-Branchformer encoder and region-aware two-level language tokens to capture dialectal variation. The study demonstrates substantial WER improvements over Whisper across multiple datasets and model sizes, supported by data-processing enhancements and memory-management techniques that enable efficient training on large-scale data. By releasing Dolphin base and small models with inference code, the work promotes reproducibility and community-driven progress in Eastern-language ASR.

Abstract

This report introduces Dolphin, a large-scale multilingual automatic speech recognition (ASR) model that extends the Whisper architecture to support a wider range of languages. Our approach integrates in-house proprietary and open-source datasets to refine and optimize Dolphin's performance. The model is specifically designed to achieve notable recognition accuracy for 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. Experimental evaluations show that Dolphin significantly outperforms current state-of-the-art open-source models across various languages. To promote reproducibility and community-driven innovation, we are making our trained models and inference source code publicly available.

Paper Structure

This paper contains 21 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Multitask format used by Dolphin, which mostly follows OpenAI Whisperradford2023robust. Dolphin focuses on ASR and does not support translation task. In addition, Dolphin introduces region-specific tokens, thus enabling support for dialects.
  • Figure 2: The distribution of data duration across 40 Eastern languages in the cleaned dataset, represented on a logarithmic scale. There are 36 languages with a data duration greater than 100 hours, and 16 languages with a data duration exceeding 1000 hours.
  • Figure 3: Data loading strategy optimization. Assume a node with 4 GPUs, each GPU is assigned a corresponding process, referred to as a rank. Before optimization, each rank loads a complete copy of the dataset, denoted as {D0,D1,D2,D3}. After optimization, each rank is assigned only the subset of the dataset it requires for computation.