AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

Jielin Qiu; Jianguo Zhang; Zixiang Chen; Liangwei Yang; Ming Zhu; Juntao Tan; Haolin Chen; Wenting Zhao; Rithesh Murthy; Roshan Ram; Akshara Prabhakar; Shelby Heinecke; Caiming; Xiong; Silvio Savarese; Huan Wang

AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming, Xiong, Silvio Savarese, Huan Wang

TL;DR

The results reveal that Gemini models generally outperform OpenAI models on overall captioning quality, with Gemini~3~Pro achieving the highest overall score (6.00/10), while OpenAI models exhibit lower hallucination rates.

Abstract

We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models. \method covers three distinct audio domains, including environmental sound, music, and speech, with 1,000 curated evaluation samples drawn from established datasets. We evaluate 13 models across two providers (OpenAI, Google Gemini) using both reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework that scores predictions on three orthogonal dimensions: \textit{accuracy} (semantic correctness), \textit{completeness} (coverage of reference content), and \textit{hallucination} (absence of fabricated content). Our results reveal that Gemini models generally outperform OpenAI models on overall captioning quality, with Gemini~3~Pro achieving the highest overall score (6.00/10), while OpenAI models exhibit lower hallucination rates. All models perform best on speech captioning and worst on music captioning. We release the benchmark as well as evaluation code to facilitate reproducible audio understanding research.

AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

TL;DR

Abstract

Paper Structure (35 sections, 2 equations, 4 figures, 3 tables)

This paper contains 35 sections, 2 equations, 4 figures, 3 tables.

Introduction
AudioCapBench
Dataset Construction
Evaluation Metrics
LLM-as-Judge (Primary).
Reference-Based Metrics (Secondary).
Models Evaluated
Results
Overall Leaderboard
Per-Category Analysis
Findings
Accuracy--Hallucination Trade-off.
Realtime Models Underperform.
Gemini's Completeness Advantage.
Reference Metrics vs. LLM Judge.
...and 20 more sections

Figures (4)

Figure 1: Distribution of audio duration (top) and reference caption length (bottom) for each category. Dashed lines indicate means. Music clips are longest; speech clips are shortest but have detailed emotional captions.
Figure 2: Overall LLM judge scores for all 13 models, colored by provider. Gemini models (blue) dominate the top positions, followed by OpenAI Chat Completions (green) and Realtime (orange).
Figure 3: Per-category LLM Overall scores. Speech is consistently the easiest category across all models, while music is the hardest. Performance gaps between models are largest on sound captioning.
Figure 4: Accuracy vs. hallucination trade-off. Models in the upper-right are both accurate and grounded. OpenAI mini models cluster in the upper-left (conservative but empty). Gemini models tend toward higher accuracy but lower hallucination scores.

AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

TL;DR

Abstract

AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (4)