MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Qian Chen; Yafeng Chen; Yanni Chen; Mengzhe Chen; Yingda Chen; Chong Deng; Zhihao Du; Ruize Gao; Changfeng Gao; Zhifu Gao; Yabin Li; Xiang Lv; Jiaqing Liu; Haoneng Luo; Bin Ma; Chongjia Ni; Xian Shi; Jialong Tang; Hui Wang; Hao Wang; Wen Wang; Yuxuan Wang; Yunlan Xu; Fan Yu; Zhijie Yan; Yexin Yang; Baosong Yang; Xian Yang; Guanrou Yang; Tianyu Zhao; Qinglin Zhang; Shiliang Zhang; Nan Zhao; Pei Zhang; Chong Zhang; Jinren Zhou

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou

TL;DR

This work introduces MinMo, an ~8B parameter aligned multimodal LLM designed for seamless voice interaction, addressing the limitations of native and aligned approaches through a four-stage speech-focused alignment trained on 1.4M hours of diverse audio. It features a streaming voice decoder and a full duplex mechanism, enabling real-time, end-to-end voice conversations while preserving core LLM capabilities, achieving state-of-the-art results across ASR, speech translation, emotion recognition, and speaker analysis. The model demonstrates low latency (approximately 100 ms for speech-to-text and about 600 ms for full duplex) and supports instruction-following voice generation with controllable emotions, dialects, and speaking styles. Limitations include constrained instruction-following updates to the LLM via LoRA, long-tail pronunciation issues, and the need for a fully end-to-end duplex system with integrated AEC/VAD in future work.

Abstract

Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 4 figures, 23 tables)

This paper contains 36 sections, 1 equation, 4 figures, 23 tables.

Introduction
Related Work
Multimodal Spoken Dialogue Models
Text Style-Controllable Speech Synthesis
MinMo
Model Architecture
Streaming Voice Decoder
Tasks and Training Data
Model Training
Experiments
Speech Recognition and Translation
Multilingual Speech Recognition
Multilingual Speech Translation
Language Identification
Contextual Biasing Speech Recognition
...and 21 more sections

Figures (4)

Figure 1: Performance comparison between our MinMo($\sim$8B parameters) and top-tier speech-text multimodal models, including Moshi(7B) DBLP:journals/corr/abs-2410-00037, Freeze-Omni(7.5B) wang2024freeze, GLM-4-Voice(9B) zeng2024glm, SeamlessM4T Large v2(2.3B) DBLP:journals/corr/abs-2308-11596, NExT-GPT(12.42B) DBLP:conf/icml/Wu0Q0C24, speech-to-text model Qwen2-Audio($\sim$8B) chu2024qwen2, Whisper-large-v3(1.55B) radford2023robust, and others. We demonstrate capabilities of MinMo on automatic speech recognition (ASR), speech-to-text translation (S2TT), spoken question answering (SQA) encompasses both speech-to-text (S2T) and speech-to-speech (S2S), vocal sound classification (VSC), speech emotion recognition (SER), language identification (LID), age recognition and gender detection. ASR is evaluated using 1-WER%, with Fleurs & Common Voice results are averaged over 10 languages (zh, en, ja, ko, yue, de, fr, ru, es, it). S2TT is evaluated using BLEU, with CoVoST2 results averaged over en2zh, en2ja, zh/ja/de/fr/ru/es/it2en translation directions. SQA is eavaluated using Accuracy. SER is evaluated using Weighted Accuracy. MinMo surpasses the previous SOTA models on all these tasks.
Figure 2: Examples demonstrating various capabilities of MinMo. More capabilities of MinMo include the tasks shown in Table \ref{['tab:MinMo_data']}.
Figure 3: The overall architecture of MinMo. Table \ref{['tab:MinMo_module']} provides detailed descriptions of each module in this diagram.
Figure 4: Detailed training data for the Speech-to-Text Alignment stage. Left: Data distribution for Full-Align training. Right: Data distribution for instruction fine-tuning (SFT).

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

TL;DR

Abstract

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)