MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Zihao Wang; Haoxuan Liu; Jiaxing Yu; Tao Zhang; Yan Liu; Kejun Zhang

MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Zihao Wang, Haoxuan Liu, Jiaxing Yu, Tao Zhang, Yan Liu, Kejun Zhang

TL;DR

This work tackles the challenge of aligning colloquial human descriptions with AI-generated songs by introducing CaiMD, a Chinese colloquial music description dataset built on CaiMAP, and MuDiT/MuSiT, an end-to-end single-stage diffusion framework. The system leverages MuChin cross-modal embeddings, a fine-tuned Lyric LLM to produce structure-aware lyrics, and transformer-based diffusion (DiT/SiT) operating in the VAE latent space, with VAE/HIFI-GAN decoding to audio. Training combines supervised pretraining on large paired lyrics-audio data, unsupervised VAE/HIFI-GAN pretraining, and CaiMD-based fine-tuning, enabling improved alignment with colloquial inputs and musical structure. Empirical results show MuSiT/MuDiT outperform open-source baselines on comparable data scales and parameter budgets, and validate SiT's advantage for time-series music generation, underscoring CaiMD's value for end-to-end, human-centered AI music generation.

Abstract

Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model, with the ultimate goal of creating songs that accurately satisfy human auditory expectations and structurally align with musical norms. Current datasets are limited due to their narrow descriptive scope, semantic gaps and inaccuracies. To overcome data scarcity in this domain, we present the Caichong Music Dataset (CaiMD). CaiMD is manually annotated by both professional musicians and amateurs, offering diverse perspectives and a comprehensive understanding of colloquial descriptions. Unlike existing datasets pre-set with expert annotations or auto-generated ones with inherent biases, CaiMD caters more sufficiently to our purpose of aligning AI-generated music with widespread user-desired results. Moreover, we propose an innovative single-stage framework called MuDiT/MuSiT for enabling effective human-machine alignment in song creation. This framework not only achieves cross-modal comprehension between colloquial language and auditory music perceptions but also ensures generated songs align with user-desired results. MuDiT/MuSiT employs one DiT/SiT model for end-to-end generation of musical components like melody, harmony, rhythm, vocals, and instrumentation. The approach ensures harmonious sonic cohesiveness amongst all generated musical components, facilitating better resonance with human auditory expectations.

MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 5 figures, 1 table)

This paper contains 20 sections, 3 equations, 5 figures, 1 table.

Introduction
Background
Text-to-Song Generation
Annotated Music Datasets
Automatic Annotation
Manual Annotation
Transformer-Based Diffusion Models
Methodology
Task Definition
Caichong Music Dataset Construction
CaiMAP: Caichong Multitask Annotation Platform
Annotation Process
MuDiT/MuSiT Framework
Framework Overview
Transforming Noise to VAE Space
...and 5 more sections

Figures (5)

Figure 1: Pipeline of data annotation and assurance. Each annotated data undergoes 5 complex phases to ensure the accuracy. The figure shows the actual screenshots of the pages for each phase.
Figure 2: An overview of CaiMD. The Chinese Colloquial Descriptions consist of Description(A) and Common Description(P and A) annotated by amateur annotators. In addition, we recruit professional annotators to label Description(P), Musical Sections, and Rhyming Structures of the lyrics. And machine-annotated information such as MIDI is also incorporated.
Figure 3: The overall architecture of MuDiT/MuSiT. We transform the text description into vectors using MuChin, which are then concatenated with noise to serve as the noised latent input for DiT. Additionally, we use a fine-tuned LLM to generate structured lyrics. Next, DiT/SiT generates the entire song content in the form of VAE latent space, encompassing melody, harmony, rhythm, vocals, instrumentation, and other musical components. Finally, we use VAE decoder and HIFI-GAN to decode the song content into Mel spectrograms and convert them into WAV audio files.
Figure 4: The complete diffusion process of DiT/SiT. The DiT/SiT model starts at timestep t = N and repeatedly subtracts the predicted noise from the noisy sample, continuing this process multiple times until it reaches timestep t = 0, resulting in the final song content, which is output in the form of a VAE latent space.
Figure 5: One timestep in the diffusion process of DiT/SiT. The noisy sample is Random Noise only at timestep t = N; in subsequent timesteps, it is the denoised sample. Due to the varying length characteristic of song structures and lyrics, they can only be processed through cross-attention. In contrast, text descriptions are converted into fixed-length vectors via MuChin, allowing them to be concatenated with the noisy sample and processed through self-attention. Finally, the predicted noise at the current timestep t is output.

MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

TL;DR

Abstract

MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)