Discrete Audio Representations for Automated Audio Captioning
Jingguang Tian, Haoqin Sun, Xinhui Hu, Xinkang Xu
TL;DR
This work systematically evaluates discrete audio representations for automated audio captioning, contrasting acoustic token codecs with semantic tokenizations derived from BEATs-based representations. It shows that discretization generally degrades AAC performance due to information loss, but semantic token approaches—especially a supervised audio tokenizer trained with an audio tagging objective—can recover semantic content and approach the performance of continuous representations on Clotho. The study demonstrates strong results for BEATs-derived semantic tokens via multiple tokenization strategies, and highlights the conditions under which domain data and token vocabularies impact performance. Overall, the paper advances discrete representation for AAC by introducing a supervised tokenization pipeline that mitigates information loss and achieves competitive results with continuous inputs.
Abstract
Discrete audio representations, termed audio tokens, are broadly categorized into semantic and acoustic tokens, typically generated through unsupervised tokenization of continuous audio representations. However, their applicability to automated audio captioning (AAC) remains underexplored. This paper systematically investigates the viability of audio token-driven models for AAC through comparative analyses of various tokenization methods. Our findings reveal that audio tokenization leads to performance degradation in AAC models compared to those that directly utilize continuous audio representations. To address this issue, we introduce a supervised audio tokenizer trained with an audio tagging objective. Unlike unsupervised tokenizers, which lack explicit semantic understanding, the proposed tokenizer effectively captures audio event information. Experiments conducted on the Clotho dataset demonstrate that the proposed audio tokens outperform conventional audio tokens in the AAC task.
