Music Style Transfer with Time-Varying Inversion of Diffusion Models

Sifei Li; Yuxin Zhang; Fan Tang; Chongyang Ma; Weiming dong; Changsheng Xu

Music Style Transfer with Time-Varying Inversion of Diffusion Models

Sifei Li, Yuxin Zhang, Fan Tang, Chongyang Ma, Weiming dong, Changsheng Xu

TL;DR

This work tackles text-guided music style transfer under data scarcity by introducing Time-Varying Textual Inversion (TVE) built on a diffusion-based backbone (Riffusion) to embed a style audio as a pseudo-word whose embedding shifts from texture to structure across diffusion timesteps, enabling precise transfer from arbitrary examples while preserving melody via content guidance. A Bias-Reduced Stylization (BRS) strategy performs a partial diffusion up to $t_p = T \cdot strength$ and denoises with the predicted noise to stabilize content preservation, operating in the latent space of a VAE. The approach is evaluated on a small, diverse dataset and outperforms baselines (R+TI, SS VQ-VAE, MUSICGEN) in both objective CLAP-based metrics and a multi-participant user study, demonstrating improved content preservation and style fit. This method enables robust music style transfer from non-musical and natural sounds and points toward more interpretable, attribute-disentangled music stylization with stronger pretrained models.

Abstract

With the development of diffusion models, text-guided image style transfer has demonstrated high-quality controllable synthesis results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we propose a bias-reduced stylization technique to obtain stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and source code are available at https://lsfhuihuiff.github.io/MusicTI/.

Music Style Transfer with Time-Varying Inversion of Diffusion Models

TL;DR

and denoises with the predicted noise to stabilize content preservation, operating in the latent space of a VAE. The approach is evaluated on a small, diverse dataset and outperforms baselines (R+TI, SS VQ-VAE, MUSICGEN) in both objective CLAP-based metrics and a multi-participant user study, demonstrating improved content preservation and style fit. This method enables robust music style transfer from non-musical and natural sounds and points toward more interpretable, attribute-disentangled music stylization with stronger pretrained models.

Abstract

Paper Structure (8 sections, 6 equations, 4 figures, 2 tables)

This paper contains 8 sections, 6 equations, 4 figures, 2 tables.

Music style transfer.
Text-to-music generation.
Textual inversion.
Dataset.
Implementation details.
User study.
Time-varying embedding (TVE).
Bias-reduced stylization.

Figures (4)

Figure 1: Music style transfer results using our method. Our approach can accurately transfer the style of various mel-spectrograms (e.g., instruments, natural sounds, synthetic sound) to content mel-spectrograms using minimal reference data, even as little as a five-second clip. In the style mel-spectrograms, the black box highlights the regions with prominent texture. It can be observed in the blue boxes that the style transfer results preserve a similar structure to the content mel-spectrograms while exhibiting similar texture to the style mel-spectrograms.
Figure 2: An overview of our method. We adopt Riffusion web_reference as the backbone network and propose a time-varying textual inversion module, which mainly consists of a time-varying encoder (TVE) as shown on the right. Performing several linear layers on the timestep $t_e$, and then adding the output to the initial embedding $v_{o*}$, TVE gives the final embedding $v_{i*}$ through multiple attention modules. $M_s$, $\hat{M}_{s}$, $M_c$, $M_{cn}$, $\hat{z}_{t_p}$, $\hat{M}_{cn}$, $\hat{M}_{cs}$ respectively represent style mel-spectrogram, reconstructed style mel-spectrogram, content mel-spectrogram, noisy content mel-spectrogram, predicted noise, predicted noisy content mel-spectrogam and stylized mel-spectrogram.
Figure 3: Our time-varying textual inversion module extends the time-step dimension of text embeddings. When reconstructing style mel-spectrograms, the text embeddings exhibit differentiation in the time-step dimension. As the time steps increase, the focus of the text embeddings shifts from texture to structure.
Figure 4: Qualitative comparison with state-of-the-arts methods web_referencegal2022imagecifka2021selfcopet2023simple. (a) Style mel-spectrograms, the texts on the left are the sound categories. (b) Mel-spectrograms. (c)-(d) The stylized results of various methods. In the style mel-spectrograms, the black box highlights the regions with prominent texture. It can be observed in the blue boxes that only our results preserve a similar structure to the content mel-spectrograms while exhibiting a similar texture to the style mel-spectrograms.

Music Style Transfer with Time-Varying Inversion of Diffusion Models

TL;DR

Abstract

Music Style Transfer with Time-Varying Inversion of Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)