SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation
Xinlei Niu, Jing Zhang, Christian Walder, Charles Patrick Martin
TL;DR
SoundLoCD tackles text-to-sound generation under limited compute by integrating a frozen DiffSound backbone with LoRA adapters and a conditional discrete contrastive diffusion objective. The method encodes audio via a pre-trained spectrogram VQ-VAE and conditions on text features, while a CDCD loss pulls the correct text–sound pair closer and pushes negatives apart across $N$ negative samples, guided by a discrete forward diffusion on the latent codes. Empirical results on AudioCaps and ESC50 show SoundLoCD achieves better fidelity, diversity, and text correspondence with only $\sim$2.38M trainable parameters, outperforming the DiffSound baseline and demonstrating robustness to text encoder choices. The approach offers practical advantages for low-resource T2S applications and establishes a pathway for efficient fine-tuning of large diffusion models through LoRA and contrastive objectives.
Abstract
We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.
