Table of Contents
Fetching ...

Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, Zhizheng Wu

TL;DR

Metis tackles the challenge of unifying speech generation tasks by pre-training a foundation model on large-scale unlabeled speech using masked generative modeling and then fine-tuning with task-specific conditions. It builds a two-stage framework that first generates SSL tokens and then reconstructs acoustic tokens via a unified masked acoustic decoder, enabling efficient adaptation to zero-shot TTS, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech with multimodal inputs. The approach achieves state-of-the-art or competitive results across five tasks while using far fewer trainable parameters and substantially less task-specific data than baselines, and it demonstrates strong potential for multi-task learning through Metis-Omni. This work advances practical, data-efficient, and adaptable speech generation, with implications for scalable deployment and broader multimodal applications, while highlighting the need for safeguards against misuse of powerful voice-generation models.

Abstract

We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at https://metis-demo.github.io/.

Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

TL;DR

Metis tackles the challenge of unifying speech generation tasks by pre-training a foundation model on large-scale unlabeled speech using masked generative modeling and then fine-tuning with task-specific conditions. It builds a two-stage framework that first generates SSL tokens and then reconstructs acoustic tokens via a unified masked acoustic decoder, enabling efficient adaptation to zero-shot TTS, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech with multimodal inputs. The approach achieves state-of-the-art or competitive results across five tasks while using far fewer trainable parameters and substantially less task-specific data than baselines, and it demonstrates strong potential for multi-task learning through Metis-Omni. This work advances practical, data-efficient, and adaptable speech generation, with implications for scalable deployment and broader multimodal applications, while highlighting the need for safeguards against misuse of powerful voice-generation models.

Abstract

We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at https://metis-demo.github.io/.

Paper Structure

This paper contains 84 sections, 1 equation, 2 figures, 11 tables.

Figures (2)

  • Figure 1: An illustration of Metis. (a) provides an overview of the two-stage speech generation framework, which consists of task-specific (yellow block) and task-independent (light blue block) processes. In this work, we focus on developing a pre-training model for the first stage, as illustrated in (b). (c) demonstrates the fine-tuning process, where the pre-trained model is adapted to specific tasks.
  • Figure 2: Two discrete speech representations for the two-stage speech generation: SSL tokens (left) and acoustic tokens (right).