Table of Contents
Fetching ...

Towards Generating Diverse Audio Captions via Adversarial Training

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

TL;DR

This work tackles the lack of diversity in automated audio captioning by introducing a conditional GAN (C-GAN) framework that combines a caption generator with two hybrid discriminators (naturalness and semantic fidelity) and a CIDEr-based language evaluator. The generator, pretrained with MLE, is augmented with a noise vector to produce diverse captions, and is trained via reinforcement learning with SCST to maximize a reward that blends naturalness, semantic relevance, and conventional evaluation scores. Through extensive experiments on Clotho v2.0, the approach yields greater corpus- and set-level diversity while maintaining competitive fidelity, with ablations clarifying the roles of each component and pretraining. The method also demonstrates improved naturalness per GPT-4 evaluation and extends prior ICASSP work by integrating a semantic evaluator into the adversarial training loop, offering a practical path toward more human-like, varied audio descriptions.

Abstract

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.

Towards Generating Diverse Audio Captions via Adversarial Training

TL;DR

This work tackles the lack of diversity in automated audio captioning by introducing a conditional GAN (C-GAN) framework that combines a caption generator with two hybrid discriminators (naturalness and semantic fidelity) and a CIDEr-based language evaluator. The generator, pretrained with MLE, is augmented with a noise vector to produce diverse captions, and is trained via reinforcement learning with SCST to maximize a reward that blends naturalness, semantic relevance, and conventional evaluation scores. Through extensive experiments on Clotho v2.0, the approach yields greater corpus- and set-level diversity while maintaining competitive fidelity, with ablations clarifying the roles of each component and pretraining. The method also demonstrates improved naturalness per GPT-4 evaluation and extends prior ICASSP work by integrating a semantic evaluator into the adversarial training loop, offering a practical path toward more human-like, varied audio descriptions.

Abstract

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
Paper Structure (24 sections, 7 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 7 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed adversarial training framework, where the caption generator aims at generating captions to confuse the two hybrid discriminators, while the naturalness discriminator aims at correctly classifying human-annotated and machine-generated captions, and the semantic discriminator aims at discriminating whether the generated captions are faithful to the content of the given audio clips. The language evaluator evaluates captions based on conventional evaluation metrics.
  • Figure 2: Diagram of the caption generator, which consists of a 10-layer CNN as audio encoder and a 2-layer Transformer as language decoder. To encourage diversity in the generated caption, a random noise vector is concatenated with the audio features extracted by the audio encoder before fed into the text decoder.
  • Figure 3: Diagram of the hybrid discriminators. (a) The naturalness discriminator receives a caption as input and outputs a probability indicating how natural the caption is. (b) The semantic discriminator receives an audio clip and a caption as inputs, and outputs a probability indicating whether the caption is faithful to the content of the input audio clip or not.
  • Figure 4: Comparison of $n$-gram ($n$ up to 3.0) count ratios on the test set with different models. An $n$-gram count ratio is computed between the frequency of $n$-gram in generated captions to its expected frequency in the test set. A count ratio around 1.0 means that the vocabulary statistics of the test set match well with those of the training set.
  • Figure 5: Diagram of the change of vocabulary size with different word counts threshold.