VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
Jiliang Hu, Wenfu Wang, Zuchao Li, Chenxing Li, Yiyang Zhao, Hanzhao Li, Liqiang Zhang, Meng Yu, Dong Yu
TL;DR
VCB Bench tackles the lack of real-speech Chinese benchmarks for large audio language models by constructing a large-scale dataset from authentic recordings and evaluating models along Instruction Following, Knowledge, and Robustness. It introduces TIF, SIF, MTD, GK/ML/DC/SC, SV/EV/CV, with bilingual support, using a rigorous data-quality pipeline and manual screening. The experiments identify Qwen3-Omni as the current SOTA and GPT-4o-Audio as a strong SIF performer, while revealing persistent challenges in cross-lingual alignment and robustness to real-world perturbations. The benchmark offers a reproducible framework and practical guidance for advancing Chinese voice conversational LALMs.
Abstract
Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.
