Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling
Jiatao Chen, Xing Tang, Xiaoyue Duan, Yutang Feng, Jinchao Zhang, Jie Zhou
TL;DR
Tutti addresses the challenge of dynamic multi-singer arrangement within a single song by introducing structure-level timbre control and vocal texture modeling. It combines a structure-aware singer prompt with an adaptive fuser and a condition-guided VAE to capture implicit textures, all integrated into a Latent Diffusion Transformer backbone to generate cohesive multi-singer vocal performances. Key contributions include the first multi-singer generation framework for structured scheduling, a texture-learning module that disentangles texture from explicit controls, and extensive evaluations showing improved intelligibility, timbre fusion, and choral realism. The approach advances practical multi-singer SVS with potential for more expressive ensemble music generation, validated by quantitative metrics and qualitative analyses, including visualization of chorus-like texture and timing behavior.
Abstract
While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi-singer generation. Specifically, we introduce a Structure-Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learning via Condition-Guided VAE to capture implicit acoustic textures (e.g., spatial reverberation and spectral fusion) that are complementary to explicit controls. Experiments demonstrate that Tutti excels in precise multi-singer scheduling and significantly enhances the acoustic realism of choral generation, offering a novel paradigm for complex multi-singer arrangement. Audio samples are available at https://annoauth123-ctrl.github.io/Tutii_Demo/.
