Table of Contents
Fetching ...

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu

TL;DR

The paper tackles exposure bias in encoder-decoder audio captioning by introducing a temporally aware cross-modal similarity measure. It proposes the unbiased sliced Wasserstein RBF (USW-RBF) kernel, augmented with rotary positional embedding, and proves unbiasedness with a parametric estimation rate $O(L^{-1/2})$. The ACUS framework embeds this kernel into training and uses stochastic decoding to mitigate caption degeneration during inference. Across AudioCaps and Clotho, with multiple backbones, ACUS improves caption length, lexical diversity, and text-to-audio retrieval, while ablations justify the temporal encoding and kernel choices.

Abstract

Teacher-forcing training for audio captioning usually leads to exposure bias due to training and inference mismatch. Prior works propose the contrastive method to deal with caption degeneration. However, the contrastive method ignores the temporal information when measuring similarity across acoustic and linguistic modalities, leading to inferior performance. In this work, we develop the temporal-similarity score by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel equipped with rotary positional embedding to account for temporal information across modalities. In contrast to the conventional sliced Wasserstein RBF kernel, we can form an unbiased estimation of USW-RBF kernel via Monte Carlo estimation. Therefore, it is well-suited to stochastic gradient optimization algorithms, and its approximation error decreases at a parametric rate of $\mathcal{O}(L^{-1/2})$ with $L$ Monte Carlo samples. Additionally, we introduce an audio captioning framework based on the unbiased sliced Wasserstein kernel, incorporating stochastic decoding methods to mitigate caption degeneration during the generation process. We conduct extensive quantitative and qualitative experiments on two datasets, AudioCaps and Clotho, to illustrate the capability of generating high-quality audio captions. Experimental results show that our framework is able to increase caption length, lexical diversity, and text-to-audio self-retrieval accuracy.

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

TL;DR

The paper tackles exposure bias in encoder-decoder audio captioning by introducing a temporally aware cross-modal similarity measure. It proposes the unbiased sliced Wasserstein RBF (USW-RBF) kernel, augmented with rotary positional embedding, and proves unbiasedness with a parametric estimation rate . The ACUS framework embeds this kernel into training and uses stochastic decoding to mitigate caption degeneration during inference. Across AudioCaps and Clotho, with multiple backbones, ACUS improves caption length, lexical diversity, and text-to-audio retrieval, while ablations justify the temporal encoding and kernel choices.

Abstract

Teacher-forcing training for audio captioning usually leads to exposure bias due to training and inference mismatch. Prior works propose the contrastive method to deal with caption degeneration. However, the contrastive method ignores the temporal information when measuring similarity across acoustic and linguistic modalities, leading to inferior performance. In this work, we develop the temporal-similarity score by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel equipped with rotary positional embedding to account for temporal information across modalities. In contrast to the conventional sliced Wasserstein RBF kernel, we can form an unbiased estimation of USW-RBF kernel via Monte Carlo estimation. Therefore, it is well-suited to stochastic gradient optimization algorithms, and its approximation error decreases at a parametric rate of with Monte Carlo samples. Additionally, we introduce an audio captioning framework based on the unbiased sliced Wasserstein kernel, incorporating stochastic decoding methods to mitigate caption degeneration during the generation process. We conduct extensive quantitative and qualitative experiments on two datasets, AudioCaps and Clotho, to illustrate the capability of generating high-quality audio captions. Experimental results show that our framework is able to increase caption length, lexical diversity, and text-to-audio self-retrieval accuracy.

Paper Structure

This paper contains 22 sections, 3 theorems, 30 equations, 2 figures, 9 tables.

Key Result

Proposition 3.2

The USW-RBF kernel with $p=2$ is a positive definite kernel for all $\gamma > 0$ and absolute continuous probability distributions $\mu$ and $\nu$.

Figures (2)

  • Figure 1: An overview of training and inference stage of the acus framework. $Z_x$ and $Z_y$ are two sequential latent representations of audio and caption, respectively.
  • Figure 2: Ablation studies for sampling hyperparmeters of stochastic sampling methods of the Enclap backbone on the AudioCaps dataset. The SPIDEr metric is chosen for sampling hyperparameters tuning since it is the combination of the SPICE and CIDEr evaluation metrics

Theorems & Definitions (4)

  • Definition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Proposition 3.4