Table of Contents
Fetching ...

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung

TL;DR

Diffusion-Link tackles the persistent audio-text modality gap that hinders coupling multimodal encoders with large language models. It introduces a lightweight diffusion-based bridge, trained on paired audio-text embeddings, to map audio into the text-embedding distribution while preserving text-space geometry, and it uses a text-decoding LLM to generate captions. The approach yields the strongest modality alignment among diffusion-based methods and achieves state-of-the-art AudioCaps performance in both zero-shot and fully supervised settings without external knowledge, with relative gains up to $52.5\%$ CIDEr (zero-shot) and $7.5\%$ (supervised). This work demonstrates that reducing the modality gap can be a more effective driver of cross-modal performance than knowledge retrieval, and it presents a broadly applicable plug-in strategy for enhancing multimodal encoder–LLM coupling.

Abstract

Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

TL;DR

Diffusion-Link tackles the persistent audio-text modality gap that hinders coupling multimodal encoders with large language models. It introduces a lightweight diffusion-based bridge, trained on paired audio-text embeddings, to map audio into the text-embedding distribution while preserving text-space geometry, and it uses a text-decoding LLM to generate captions. The approach yields the strongest modality alignment among diffusion-based methods and achieves state-of-the-art AudioCaps performance in both zero-shot and fully supervised settings without external knowledge, with relative gains up to CIDEr (zero-shot) and (supervised). This work demonstrates that reducing the modality gap can be a more effective driver of cross-modal performance than knowledge retrieval, and it presents a broadly applicable plug-in strategy for enhancing multimodal encoder–LLM coupling.

Abstract

Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link

Paper Structure

This paper contains 14 sections, 10 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: (a) Overview of the proposed Diffusion-Link mechanism and (b,c) illustration of our LLM-based AAC system with Diffusion-Link.
  • Figure 2: Visualization of embeddings on AudioCaps using UMAP. Red line means the pair of audio and text embeddings. Green line means the pair of text-like and original text embeddings.