Table of Contents
Fetching ...

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Jongsuk Kim, Jiwon Shin, Junmo Kim

TL;DR

AVCap tackles automated captioning by leveraging audio-visual features as text tokens within a self-attentive decoder, enabling scalable multi-modal caption generation. The method combines modality-specific encoders with a joint fusion stage, followed by projecting the fused representation into the text-embedding space for decoding, guided by an attention mask that integrates audio-visual and textual information. Through extensive ablations, AVCap demonstrates the value of architecture choice, pre-trained encoder adaptation, and multimodal fusion, achieving state-of-the-art or competitive results on the AudioCaps dataset. The work offers a practical, extensible baseline for multi-modal captioning with open-source code to facilitate broader adoption.

Abstract

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning. Our method outperforms existing audio-visual captioning methods across all metrics and the code is available on https://github.com/JongSuk1/AVCap

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

TL;DR

AVCap tackles automated captioning by leveraging audio-visual features as text tokens within a self-attentive decoder, enabling scalable multi-modal caption generation. The method combines modality-specific encoders with a joint fusion stage, followed by projecting the fused representation into the text-embedding space for decoding, guided by an attention mask that integrates audio-visual and textual information. Through extensive ablations, AVCap demonstrates the value of architecture choice, pre-trained encoder adaptation, and multimodal fusion, achieving state-of-the-art or competitive results on the AudioCaps dataset. The work offers a practical, extensible baseline for multi-modal captioning with open-source code to facilitate broader adoption.

Abstract

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning. Our method outperforms existing audio-visual captioning methods across all metrics and the code is available on https://github.com/JongSuk1/AVCap
Paper Structure (22 sections, 5 equations, 3 figures, 4 tables)

This paper contains 22 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Qualitative results comparing captions generated from audio-only, video-only, and audio-video training.
  • Figure 2: Overview of AVCap.
  • Figure 3: Attention mask $M$.