Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling

Jiatao Chen; Xing Tang; Xiaoyue Duan; Yutang Feng; Jinchao Zhang; Jie Zhou

Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling

Jiatao Chen, Xing Tang, Xiaoyue Duan, Yutang Feng, Jinchao Zhang, Jie Zhou

TL;DR

Tutti addresses the challenge of dynamic multi-singer arrangement within a single song by introducing structure-level timbre control and vocal texture modeling. It combines a structure-aware singer prompt with an adaptive fuser and a condition-guided VAE to capture implicit textures, all integrated into a Latent Diffusion Transformer backbone to generate cohesive multi-singer vocal performances. Key contributions include the first multi-singer generation framework for structured scheduling, a texture-learning module that disentangles texture from explicit controls, and extensive evaluations showing improved intelligibility, timbre fusion, and choral realism. The approach advances practical multi-singer SVS with potential for more expressive ensemble music generation, validated by quantitative metrics and qualitative analyses, including visualization of chorus-like texture and timing behavior.

Abstract

While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi-singer generation. Specifically, we introduce a Structure-Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learning via Condition-Guided VAE to capture implicit acoustic textures (e.g., spatial reverberation and spectral fusion) that are complementary to explicit controls. Experiments demonstrate that Tutti excels in precise multi-singer scheduling and significantly enhances the acoustic realism of choral generation, offering a novel paradigm for complex multi-singer arrangement. Audio samples are available at https://annoauth123-ctrl.github.io/Tutii_Demo/.

Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling

TL;DR

Abstract

Paper Structure (25 sections, 5 equations, 6 figures, 4 tables)

This paper contains 25 sections, 5 equations, 6 figures, 4 tables.

Introduction
Related Work
Singing Voice Synthesis
Multi-Talker Conversational Generation
Methodology
Overview
Structure-Aware Singer Prompt
Complementary Texture Learning via Condition-Guided VAE
Experiment
Experimental Setup
Main Results
Limitations and Future Work
Conclusion
Implementation and Training Details
Model Configuration
...and 10 more sections

Figures (6)

Figure 1: The overview of the Tutti framework for structure-aware multi-singer generation. The workflow begins by constructing structure-aware singer prompts and extracting complementary texture features from reference audio. These conditions, along with lyrics and structure labels, are fed into the DiT backbone to generate target vocal latents. Finally, the Vocal VAE decoder reconstructs the latents into high-fidelity multi-singer waveforms.
Figure 2: The architecture of the DiT-based backbone. The left panel illustrates the condition processing modules: the structure-aware singer prompt (denoted in orange) utilizes an adaptive fuser to integrate multi-singer information, while the texture embedding (denoted in purple) is extracted from reference audio to provide complementary acoustic features. These embeddings are concatenated with other conditions and input into the DiT blocks. Finally, the predicted latents are decoded by the Vocal VAE decoder to generate the waveform.
Figure 3: The architecture of Condition-Guided VAE for complementary texture learning.
Figure 4: Acoustic visualization analysis of generated multi-singer audio. (a) The Pitch Salience Map depicts distinct melodic patterns, showing single trajectories for solos and interwoven lines for the chorus. (b) The Mel Spectrogram illustrates the energy distribution in the time-frequency domain, highlighting spectral coherence and rich resonance structures.
Figure 5: Visualization of attention weights in the Adaptive Singer Prompt Fuser. (a) The Singer Attention Matrix shows the weight distribution of candidate singers across different structural segments. (b) The Singer-to-Singer Cross-Attention Matrix details the internal interaction weights within the multi-singer Chorus 1 segment.
...and 1 more figures

Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling

TL;DR

Abstract

Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)