Table of Contents
Fetching ...

OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

Seunghee Kim, Bumkyu Park, Kyudan Jung, Joosung Lee, Soyoon Kim, Jeonghoon Kim, Taeuk Kim, Hwiyeol Jo

Abstract

Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.

OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

Abstract

Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.
Paper Structure (65 sections, 21 figures, 7 tables)

This paper contains 65 sections, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Comparison of prior omni-modal benchmarks and OmniACBench. Existing ones assess multimodal understanding via text outputs, whereas ours targets speech generation given text, vision, and speech inputs.
  • Figure 2: Construction pipeline of OmniACBench with representative examples for each acoustic feature. (1) Acoustic Feature Selection defines the target acoustic features and associated image keywords. (2) Tri-Modal Generation constructs each instance from a neutral text script, a spoken control signal, and a generated image. (3) Quality Control Protocol applies filtering and quantitative verification to ensure data quality and diversity.
  • Figure 3: Results of Controlled Input Decomposition across all evaluation metrics. Starting from the Original setting, inputs are progressively textualized through S-to-T, I-to-T, and All-to-T, while Oracle explicitly specifies the target acoustic value. Dashed red lines in Emo-Acc, GA-Acc, and Tim-Acc indicate random baselines.
  • Figure 4: Linear probing of context-relevant information across model layers. MiniCPM-o 4.5 preserves decodable context into the TTS decoder, whereas Qwen3-Omni 30B drops to near chance in the Talker.
  • Figure 5: A prompt used for text transcript generation.
  • ...and 16 more figures