Table of Contents
Fetching ...

Robust Singing Voice Transcription Serves Synthesis

Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao

TL;DR

ROSVOT tackles the bottleneck of automatic, robust note-level singing transcription by introducing a multi-scale ROSVOT architecture that fuses coarse note semantics via Conformer with fine-grained segmentation through a U-Net backbone, guided by word boundaries and enhanced with noise-robust data augmentation. It couples an attention-based pitch decoder with joint boundary and pitch objectives to achieve state-of-the-art accuracy in both clean and noisy conditions. The authors also establish a comprehensive SVS annotation-and-training pipeline, demonstrating that SVS models trained on automatically annotated data can approach the performance of models trained on manual annotations, and that cross-lingual generalization is feasible with modest degradation. Overall, ROSVOT provides a practical path toward scalable SVS data collection and improved synthesis quality, with real-world applicability across languages and noise levels.

Abstract

Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io.

Robust Singing Voice Transcription Serves Synthesis

TL;DR

ROSVOT tackles the bottleneck of automatic, robust note-level singing transcription by introducing a multi-scale ROSVOT architecture that fuses coarse note semantics via Conformer with fine-grained segmentation through a U-Net backbone, guided by word boundaries and enhanced with noise-robust data augmentation. It couples an attention-based pitch decoder with joint boundary and pitch objectives to achieve state-of-the-art accuracy in both clean and noisy conditions. The authors also establish a comprehensive SVS annotation-and-training pipeline, demonstrating that SVS models trained on automatically annotated data can approach the performance of models trained on manual annotations, and that cross-lingual generalization is feasible with modest degradation. Overall, ROSVOT provides a practical path toward scalable SVS data collection and improved synthesis quality, with real-world applicability across languages and noise levels.

Abstract

Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io.
Paper Structure (38 sections, 6 equations, 5 figures, 10 tables)

This paper contains 38 sections, 6 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: AST and ASR systems serve SVS.
  • Figure 2: The overall architecture. $\text{E}_{\text{M}}$, $\text{E}_{\text{B}}$, and $\text{E}_{\text{P}}$ represent encoders of Mel-spectrogram, word boundaries, and F0 contour input. $\text{D}_{\text{B}}$ and $\text{D}_{\text{P}}$ stand for decoders of note boundaries and pitches. The "Down" and "Up" parts denote the encoder and decoder of the U-Net backbone. The "Seg." and "Smooth" notations indicate temporal segmentation and label smoothing operations. $E_W$ indicates an optional extractor used to provide word boundaries.
  • Figure 3: Word-note synchronization.
  • Figure 4: N-layer Residual convolution blocks.
  • Figure 5: Injection of self-supervised features.