Table of Contents
Fetching ...

RECAP: Retrieval-Augmented Audio Captioning

Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha

TL;DR

RECAP addresses domain-shifted audio captioning by integrating retrieved captions into a prompt and conditioning a GPT-2 decoder with cross-attention to CLAP audio embeddings. It uses CLAP as the audio encoder and GPT-2 as the decoder, training only the cross-attention layers and relying on a datastore of captions for external knowledge. The method achieves competitive in-domain results and significant gains out-of-domain, particularly with a large text-caption datastore, and can caption novel audio events and compositional audio without additional fine-tuning. The authors also release 150,000+ weakly labeled captions to support future research.

Abstract

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.

RECAP: Retrieval-Augmented Audio Captioning

TL;DR

RECAP addresses domain-shifted audio captioning by integrating retrieved captions into a prompt and conditioning a GPT-2 decoder with cross-attention to CLAP audio embeddings. It uses CLAP as the audio encoder and GPT-2 as the decoder, training only the cross-attention layers and relying on a datastore of captions for external knowledge. The method achieves competitive in-domain results and significant gains out-of-domain, particularly with a large text-caption datastore, and can caption novel audio events and compositional audio without additional fine-tuning. The authors also release 150,000+ weakly labeled captions to support future research.

Abstract

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.
Paper Structure (6 sections, 2 figures, 3 tables)

This paper contains 6 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: We propose RECAP, a retrieval-augmented audio captioning model. RECAP can caption novel concepts never before seen in training and improves the captioning of audio with multiple events.
  • Figure 2: Illustration of RECAP. RECAP fine-tunes a GPT-2 LM conditioned on audio representations from the last hidden state of CLAP wu2023large and a text prompt. The text prompt is constructed using captions most similar to the audio, retrieved from a datastore using CLAP.