Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Jizhong Liu; Gang Li; Junbo Zhang; Heinrich Dinkel; Yongqing Wang; Zhiyong Yan; Yujun Wang; Bin Wang

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

TL;DR

The paper tackles automated audio captioning (AAC) by addressing three bottlenecks: encoder token efficiency, decoder capability for long, multi-event audio, and data quality limitations. It introduces LOAE, an end-to-end system that uses a Consistent Ensemble Distillation (CED) encoder with LoRA, a Q-Former to bridge to a frozen Llama 2 7B decoder with LoRA, and a post-corrector LLM to fix errors from annotation gaps, all trained with cross-entropy. Key contributions include (i) reducing acoustic token counts via a $17:1$ compression and using $64$-dimensional Mel-filterbanks, (ii) grounding the decoder in a large multilingual model (Llama 2, $7B$) fine-tuned with LoRA, and (iii) leveraging a post-corrector to improve linguistic quality. The approach achieves a $33.0$ SPIDEr-FL score, outperforming the DCASE 2023 Task 6A winner, and demonstrates the effectiveness of LoRA and post-correction in making LLM-based AAC practical and scalable for long-form audio descriptions.

Abstract

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

TL;DR

compression and using

-dimensional Mel-filterbanks, (ii) grounding the decoder in a large multilingual model (Llama 2,

) fine-tuned with LoRA, and (iii) leveraging a post-corrector to improve linguistic quality. The approach achieves a

SPIDEr-FL score, outperforming the DCASE 2023 Task 6A winner, and demonstrates the effectiveness of LoRA and post-correction in making LLM-based AAC practical and scalable for long-form audio descriptions.

Abstract

Paper Structure (13 sections, 1 figure, 6 tables)

This paper contains 13 sections, 1 figure, 6 tables.

Introduction
Method
Experimental Setup
Datasets
Metrics
Implementation Details
Results
Overall Performance Comparison
Audio Encoding Comparison
Text Decoding Comparison
Impact of LoRA
Post-Corrector Analysis
Conclusion

Figures (1)

Figure 1: The architecture of the proposed LOAE method integrates an optimized encoding-decoding framework, with LoRA fine-tuning strategy. The LLM component remains frozen, while Q-Former bridges the encoder and the decoder. Additionally, an extra LLM serves as the post-corrector.

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

TL;DR

Abstract

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Authors

TL;DR

Abstract

Table of Contents

Figures (1)