SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models
Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin
TL;DR
SingOMD addresses the domain gap between speech SSL representations and singing by resynthesizing singing with a frozen SSL and then applying a Unet-based multi-resolution resampling module. The continuous features are discretized via K-means to produce singing-oriented tokens, which are used for both singing resynthesis and SVS, achieving competitive performance to mel-spectrogram baselines while improving efficiency. The approach demonstrates robust performance across resynthesis and SVS tasks and highlights the viability of leveraging speech SSL for high-quality singing generation with discrete representations. This work offers a practical pathway to discrete SVS without modifying SSL backbones, enabling better cross-domain reuse of existing SSL models.
Abstract
Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation than typical speech. To address these challenges, we introduce SingOMD, a novel method to extract singing-oriented multi-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task and incorporate multi-resolution modules based on resampling to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency, and effectiveness of these representations in singing vocoders and singing voice synthesis.
