Table of Contents
Fetching ...

DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

Shentong Mo, Zehua Chen, Fan Bao, Jun Zhu

TL;DR

DiffGAP addresses limited bidirectional interactions and noise in cross-modal contrastive learning by injecting a lightweight diffusion-based generative module into the embedding space. It introduces a Contrastive Diffusion Module to denoise embeddings conditioned on cross-modal cues and a Bidirectional Split Training scheme to balance learning directions. Across AudioCaps and VGGSound, DiffGAP improves video-to-audio and text-to-audio generation as well as retrieval, surpassing prior methods and showing robust efficiency due to a compact diffusion space and targeted conditioning. The work demonstrates the practical value of integrating diffusion processes within contrastive cross-modal frameworks for richer, more accurate multimedia understanding and generation.

Abstract

Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.

DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

TL;DR

DiffGAP addresses limited bidirectional interactions and noise in cross-modal contrastive learning by injecting a lightweight diffusion-based generative module into the embedding space. It introduces a Contrastive Diffusion Module to denoise embeddings conditioned on cross-modal cues and a Bidirectional Split Training scheme to balance learning directions. Across AudioCaps and VGGSound, DiffGAP improves video-to-audio and text-to-audio generation as well as retrieval, surpassing prior methods and showing robust efficiency due to a compact diffusion space and targeted conditioning. The work demonstrates the practical value of integrating diffusion processes within contrastive cross-modal frameworks for richer, more accurate multimedia understanding and generation.

Abstract

Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.

Paper Structure

This paper contains 11 sections, 2 equations, 7 tables, 1 algorithm.