Table of Contents
Fetching ...

A Whisper transformer for audio captioning trained with synthetic captions and transfer learning

Marek Kadlčík, Adam Hájek, Jürgen Kieslich, Radosław Winiecki

TL;DR

This work investigates audio captioning by leveraging a Whisper encoder-decoder pretrained on speech-to-text tasks and further pretrained on synthetic captions derived from AudioSet, complemented by human-annotated data (AudioCaps, Clotho). The authors study data mixtures, model sizes, finetuning strategies, and decoding methods, finding that STT pretraining and mixed synthetic-plus-human caption pretraining improve SPIDEr scores, especially for larger Whisper variants. They show that larger models benefit more from AC pretraining, that overfitting can be mitigated by selective layer freezing, and that beam search outperforms other decoding strategies for caption quality. The work provides practical insights for leveraging synthetic captions and transfer learning in audio captioning, delivering public code and pretrained models to support future research and deployment.

Abstract

The field of audio captioning has seen significant advancements in recent years, driven by the availability of large-scale audio datasets and advancements in deep learning techniques. In this technical report, we present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions. We discuss our training procedures and present our experiments' results, which include model size variations, dataset mixtures, and other hyperparameters. Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model. Our code and trained models are publicly available on GitHub and Hugging Face Hub.

A Whisper transformer for audio captioning trained with synthetic captions and transfer learning

TL;DR

This work investigates audio captioning by leveraging a Whisper encoder-decoder pretrained on speech-to-text tasks and further pretrained on synthetic captions derived from AudioSet, complemented by human-annotated data (AudioCaps, Clotho). The authors study data mixtures, model sizes, finetuning strategies, and decoding methods, finding that STT pretraining and mixed synthetic-plus-human caption pretraining improve SPIDEr scores, especially for larger Whisper variants. They show that larger models benefit more from AC pretraining, that overfitting can be mitigated by selective layer freezing, and that beam search outperforms other decoding strategies for caption quality. The work provides practical insights for leveraging synthetic captions and transfer learning in audio captioning, delivering public code and pretrained models to support future research and deployment.

Abstract

The field of audio captioning has seen significant advancements in recent years, driven by the availability of large-scale audio datasets and advancements in deep learning techniques. In this technical report, we present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions. We discuss our training procedures and present our experiments' results, which include model size variations, dataset mixtures, and other hyperparameters. Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model. Our code and trained models are publicly available on GitHub and Hugging Face Hub.
Paper Structure (29 sections, 2 figures, 4 tables)

This paper contains 29 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Creating a synthetic caption from an AudioSet label
  • Figure 2: Examples of output sequences. The red parts are tokens used by Whisper architecture, and the blue part denotes our own prefix specifying the source dataset and captioning task.