Table of Contents
Fetching ...

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Xinlei Niu, Jing Zhang, Christian Walder, Charles Patrick Martin

TL;DR

SoundLoCD tackles text-to-sound generation under limited compute by integrating a frozen DiffSound backbone with LoRA adapters and a conditional discrete contrastive diffusion objective. The method encodes audio via a pre-trained spectrogram VQ-VAE and conditions on text features, while a CDCD loss pulls the correct text–sound pair closer and pushes negatives apart across $N$ negative samples, guided by a discrete forward diffusion on the latent codes. Empirical results on AudioCaps and ESC50 show SoundLoCD achieves better fidelity, diversity, and text correspondence with only $\sim$2.38M trainable parameters, outperforming the DiffSound baseline and demonstrating robustness to text encoder choices. The approach offers practical advantages for low-resource T2S applications and establishes a pathway for efficient fine-tuning of large diffusion models through LoRA and contrastive objectives.

Abstract

We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

TL;DR

SoundLoCD tackles text-to-sound generation under limited compute by integrating a frozen DiffSound backbone with LoRA adapters and a conditional discrete contrastive diffusion objective. The method encodes audio via a pre-trained spectrogram VQ-VAE and conditions on text features, while a CDCD loss pulls the correct text–sound pair closer and pushes negatives apart across negative samples, guided by a discrete forward diffusion on the latent codes. Empirical results on AudioCaps and ESC50 show SoundLoCD achieves better fidelity, diversity, and text correspondence with only 2.38M trainable parameters, outperforming the DiffSound baseline and demonstrating robustness to text encoder choices. The approach offers practical advantages for low-resource T2S applications and establishes a pathway for efficient fine-tuning of large diffusion models through LoRA and contrastive objectives.

Abstract

We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.
Paper Structure (9 sections, 10 equations, 2 figures, 4 tables)

This paper contains 9 sections, 10 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overall pipeline of SoundLoCD, which is performed based on a pre-trained spectrogram VQ-VAE. SoundLoCD involves $N+1$ parallel discrete diffusion processes on original data and $N$ randomly shuffled negative data.
  • Figure 2: The visualization of generated samples by the DiffSound and SoundLoCD compared with ground truth.