Table of Contents
Fetching ...

Zero-Shot Audio Captioning Using Soft and Hard Prompts

Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua Tan, Wenwu Wang, Zhanyu Ma

TL;DR

This work tackles data scarcity and cross-domain generalization in audio captioning by proposing a zero-shot approach that trains solely on textual data. It builds on CLAP to place audio and text in a shared semantic space and introduces two prompting strategies: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt, to bridge modality gaps during inference. Empirical results on AudioCaps and Clotho show the method outperforms other zero-shot approaches in both in-domain and cross-domain settings and even approaches some fully supervised baselines, with strong multilingual capabilities demonstrated via a multilingual extension. The approach offers a practical path to scalable, cross-domain, and multilingual audio captioning without requiring costly audio-text paired data, with broad implications for real-world accessibility and retrieval applications.

Abstract

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space.In the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.

Zero-Shot Audio Captioning Using Soft and Hard Prompts

TL;DR

This work tackles data scarcity and cross-domain generalization in audio captioning by proposing a zero-shot approach that trains solely on textual data. It builds on CLAP to place audio and text in a shared semantic space and introduces two prompting strategies: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt, to bridge modality gaps during inference. Empirical results on AudioCaps and Clotho show the method outperforms other zero-shot approaches in both in-domain and cross-domain settings and even approaches some fully supervised baselines, with strong multilingual capabilities demonstrated via a multilingual extension. The approach offers a practical path to scalable, cross-domain, and multilingual audio captioning without requiring costly audio-text paired data, with broad implications for real-world accessibility and retrieval applications.

Abstract

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space.In the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.
Paper Structure (30 sections, 4 equations, 2 figures, 14 tables)

This paper contains 30 sections, 4 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: (a) The structure of the CLAP model. Through contrast learning, CLAP maps the audio and text into the same semantic space. Grey triangles and pentagons represent audio and text embeddings, respectively. (b) The structure of the base zero-shot audio captioning model, where a language decoder is trained for text reconstruction using text data based on the CLAP text encoder. The CLAP audio encoder is combined with the language decoder to generate captions during inference.
  • Figure 2: The overall architecture of our proposed method. Specifically, in the training stage, we reconstruct the input text based on acoustic-aware prompts and soft prompts with only textual data, so training does not require any paired data. During inference, we replace the CLAP text encoder $f_{clap}^{Text}(\cdot)$ with the CLAP audio encoder $f_{clap}^{Audio}(\cdot)$ to generate the descriptive text of the input audio.