SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Xinlei Niu; Jing Zhang; Christian Walder; Charles Patrick Martin

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Xinlei Niu, Jing Zhang, Christian Walder, Charles Patrick Martin

TL;DR

SoundLoCD tackles text-to-sound generation under limited compute by integrating a frozen DiffSound backbone with LoRA adapters and a conditional discrete contrastive diffusion objective. The method encodes audio via a pre-trained spectrogram VQ-VAE and conditions on text features, while a CDCD loss pulls the correct text–sound pair closer and pushes negatives apart across $N$ negative samples, guided by a discrete forward diffusion on the latent codes. Empirical results on AudioCaps and ESC50 show SoundLoCD achieves better fidelity, diversity, and text correspondence with only $\sim$2.38M trainable parameters, outperforming the DiffSound baseline and demonstrating robustness to text encoder choices. The approach offers practical advantages for low-resource T2S applications and establishes a pathway for efficient fine-tuning of large diffusion models through LoRA and contrastive objectives.

Abstract

We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

TL;DR

negative samples, guided by a discrete forward diffusion on the latent codes. Empirical results on AudioCaps and ESC50 show SoundLoCD achieves better fidelity, diversity, and text correspondence with only

2.38M trainable parameters, outperforming the DiffSound baseline and demonstrating robustness to text encoder choices. The approach offers practical advantages for low-resource T2S applications and establishes a pathway for efficient fine-tuning of large diffusion models through LoRA and contrastive objectives.

Abstract

Paper Structure (9 sections, 10 equations, 2 figures, 4 tables)

This paper contains 9 sections, 10 equations, 2 figures, 4 tables.

Introduction
Methodology
SoundLoCD
LoRA
Conditional Discrete Contrastive Diffusion
Experiments and Results
Results and Analysis
Ablation Study
Conclusion

Figures (2)

Figure 1: Overall pipeline of SoundLoCD, which is performed based on a pre-trained spectrogram VQ-VAE. SoundLoCD involves $N+1$ parallel discrete diffusion processes on original data and $N$ randomly shuffled negative data.
Figure 2: The visualization of generated samples by the DiffSound and SoundLoCD compared with ground truth.

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

TL;DR

Abstract

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)