SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

Liumeng Xue; Chaoren Wang; Mingxuan Wang; Xueyao Zhang; Jun Han; Zhizheng Wu

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

Liumeng Xue, Chaoren Wang, Mingxuan Wang, Xueyao Zhang, Jun Han, Zhizheng Wu

TL;DR

SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre.

Abstract

In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also facilitates side-by-side comparisons of different conditions, such as source content, melody, and target timbre, highlighting the impact of these conditions on the diffusion generation process and resulting conversions. Through comparative and comprehensive evaluations, SingVisio demonstrates its effectiveness in terms of system design, functionality, explainability, and user-friendliness. It offers users of various backgrounds valuable learning experiences and insights into the diffusion model for singing voice conversion.

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 14 figures, 2 tables)

This paper contains 29 sections, 2 equations, 14 figures, 2 tables.

Introduction
Related Work
Singing Voice Conversion
Visual Analysis for Explainable AI
Background: Diffusion-based Singing Voice Conversion
Architecture and Workflow
Implementation Details and Evaluation Metrics
Design Requirements
Requirement analysis
Analytical Tasks
Explainer System
Control Panel
Step View
Comparison View
Projection View
...and 14 more sections

Figures (14)

Figure 1: The classic pipeline of SVC system, including three steps: (a) feature extraction that extracts content and melody features from the source and singer timbre from the target, (b) acoustic model mapping extracted features to acoustic features (e.g. Mel spectrogram), (c) waveform synthesizer reconstructing singing voice from the converted acoustic feature. In this study, we use "diffusion-based singing voice conversion" to refer that the acoustic model in the SVC is a diffusion model.
Figure 2: Visual system for diffusion-based singing voice conversion. The system consists of five views. (A) Metric View shows objective evaluation results on the singing voice conversion model, allowing users to interactively explore the performance trend along diffusion steps. (B) Projection View aids users in tracking the data patterns of diffusion steps in the embedding space under different input conditions. (C) Step View provides users with the visualization of Mel spectrogram and pitch contour at one diffusion step. (D) Comparison View facilitates users to compare voice conversion results among different diffusion steps or singers. (E) Control Panel enables users to select various comparison modes and choose different source and target singers to visually understand and analyze the model behavior. The red annotations provide explanations for the patterns or components.
Figure 3: The left part is the Metric View with MCD metric selected. The right part is the corresponding "Metric Curve over Diffusion Step" for the best-performing sample on the MCD metric. The red annotation in the right part explains the tendencies of metric curves.
Figure 4: Accuracy of objective questionnaires on the basic version, including tutorial group and basic SingVisio group. The questions designed for the basic version are related to analysis tasks T1 and T2 as described in Section 4.2.
Figure 5: Accuracy of objective questionnaires on advanced version. The questions designed for the advanced version are related to all analysis tasks T1-T5, as described in Section 4.2.
...and 9 more figures

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

TL;DR

Abstract

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

Authors

TL;DR

Abstract

Table of Contents

Figures (14)