Table of Contents
Fetching ...

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

Kyeongha Rho, Hyeongkeun Lee, Valentio Iverson, Joon Son Chung

TL;DR

LAVCap tackles audio captioning by leveraging visual context through a principled cross-modal alignment and fusion pipeline. It introduces an assignment-map based optimal transport loss $\mathcal{L}_{OT}$ and an OT-Att fusion mechanism to bridge the modality gap between audio and visual features, followed by decoding with an LLM conditioned on fused representations. The system is trained with a combination of autoregressive loss $\mathcal{L}_{CE}$ and OT loss, using LoRA to adapt the audio encoder while keeping the visual encoder frozen and the LLM decoder partially fine-tuned, achieving strong results on AudioCaps without large-scale pretraining. Experiments, including ablations and qualitative MOS studies, demonstrate the effectiveness and data efficiency of OT-based alignment for robust audio-visual captioning and underline the practical impact of incorporating visual context in AAC.

Abstract

Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

TL;DR

LAVCap tackles audio captioning by leveraging visual context through a principled cross-modal alignment and fusion pipeline. It introduces an assignment-map based optimal transport loss and an OT-Att fusion mechanism to bridge the modality gap between audio and visual features, followed by decoding with an LLM conditioned on fused representations. The system is trained with a combination of autoregressive loss and OT loss, using LoRA to adapt the audio encoder while keeping the visual encoder frozen and the LLM decoder partially fine-tuned, achieving strong results on AudioCaps without large-scale pretraining. Experiments, including ablations and qualitative MOS studies, demonstrate the effectiveness and data efficiency of OT-based alignment for robust audio-visual captioning and underline the practical impact of incorporating visual context in AAC.

Abstract

Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.
Paper Structure (20 sections, 9 equations, 2 figures, 7 tables)

This paper contains 20 sections, 9 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: (a) Overview of the proposed LAVCap Framework. (b) Detail of the Optimal Transport Fusion module.
  • Figure 2: Qualitative results of captions generated from models trained solely on audio, only on visual, and on both audio and visual.