Music Style Transfer with Time-Varying Inversion of Diffusion Models
Sifei Li, Yuxin Zhang, Fan Tang, Chongyang Ma, Weiming dong, Changsheng Xu
TL;DR
This work tackles text-guided music style transfer under data scarcity by introducing Time-Varying Textual Inversion (TVE) built on a diffusion-based backbone (Riffusion) to embed a style audio as a pseudo-word whose embedding shifts from texture to structure across diffusion timesteps, enabling precise transfer from arbitrary examples while preserving melody via content guidance. A Bias-Reduced Stylization (BRS) strategy performs a partial diffusion up to $t_p = T \cdot strength$ and denoises with the predicted noise to stabilize content preservation, operating in the latent space of a VAE. The approach is evaluated on a small, diverse dataset and outperforms baselines (R+TI, SS VQ-VAE, MUSICGEN) in both objective CLAP-based metrics and a multi-participant user study, demonstrating improved content preservation and style fit. This method enables robust music style transfer from non-musical and natural sounds and points toward more interpretable, attribute-disentangled music stylization with stronger pretrained models.
Abstract
With the development of diffusion models, text-guided image style transfer has demonstrated high-quality controllable synthesis results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we propose a bias-reduced stylization technique to obtain stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and source code are available at https://lsfhuihuiff.github.io/MusicTI/.
