OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning
Seunghee Kim, Ingyu Bang, Seokgyu Jang, Changhyeon Kim, Sanghwan Bae, Jihun Choi, Richeng Xuan, Taeuk Kim
TL;DR
OMHBench tackles the insufficiencies of prior omni-modal benchmarks by enforcing joint tri-modal grounding and balanced multi-hop reasoning paths across text, vision, and speech. It introduces a four-stage, rule-based pipeline to generate 6,144 questions over finance, economics, climate, and nutrition, with explicit path control and quality guarantees, including a novel Path Balance Score (PBS). Experimental results show a substantial performance gap between proprietary and open-source models, strong sensitivity to reasoning path order, and notable asymmetry in cross-modal grounding, especially involving speech. The work provides a principled framework for robust omni-modal evaluation and highlights practical gaps that limit current omni-modal intelligence, guiding future data generation and model training efforts.
Abstract
Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.
