Table of Contents
Fetching ...

OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning

Seunghee Kim, Ingyu Bang, Seokgyu Jang, Changhyeon Kim, Sanghwan Bae, Jihun Choi, Richeng Xuan, Taeuk Kim

TL;DR

OMHBench tackles the insufficiencies of prior omni-modal benchmarks by enforcing joint tri-modal grounding and balanced multi-hop reasoning paths across text, vision, and speech. It introduces a four-stage, rule-based pipeline to generate 6,144 questions over finance, economics, climate, and nutrition, with explicit path control and quality guarantees, including a novel Path Balance Score (PBS). Experimental results show a substantial performance gap between proprietary and open-source models, strong sensitivity to reasoning path order, and notable asymmetry in cross-modal grounding, especially involving speech. The work provides a principled framework for robust omni-modal evaluation and highlights practical gaps that limit current omni-modal intelligence, guiding future data generation and model training efforts.

Abstract

Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.

OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning

TL;DR

OMHBench tackles the insufficiencies of prior omni-modal benchmarks by enforcing joint tri-modal grounding and balanced multi-hop reasoning paths across text, vision, and speech. It introduces a four-stage, rule-based pipeline to generate 6,144 questions over finance, economics, climate, and nutrition, with explicit path control and quality guarantees, including a novel Path Balance Score (PBS). Experimental results show a substantial performance gap between proprietary and open-source models, strong sensitivity to reasoning path order, and notable asymmetry in cross-modal grounding, especially involving speech. The work provides a principled framework for robust omni-modal evaluation and highlights practical gaps that limit current omni-modal intelligence, guiding future data generation and model training efforts.

Abstract

Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.

Paper Structure

This paper contains 57 sections, 1 equation, 24 figures, 7 tables.

Figures (24)

  • Figure 1: Omni-Modal Understanding (OMU) benchmarks lack textual context and suffer from modality shortcuts, while Cross-Modal Multi-Hop Reasoning (CMR) datasets exclude speech and exhibit imbalanced reasoning paths. OMHBench addresses these issues.
  • Figure 2: Illustration of the proposed task formulation with two example questions and their reasoning paths: I-T (left) and I-T-S (right). Attributes (e.g., Goodwill) are accessible only through specific modalities, while entities (e.g., Company) are shared across modalities.
  • Figure 3: Fraction of instances in the OMU benchmarks that remain solvable without visual or auditory input. Across all datasets, nearly 70 80% of instances are susceptible, indicating that shortcuts allowing models to answer without using certain modalities are prevalent.
  • Figure 4: Performance comparison on the original MuMuQA and MMQA datasets vs. their controlled variants. When revised to ensure balanced reasoning paths, accuracy drops by up to 18%, raising concerns about the validity of previous evaluations on the original datasets.
  • Figure 5: Overview of the OMHBench pipeline. (1) Table Triplet Formation constructs table triplets that share the same entities yet having separate attributes. (2) Multi-Hop QA Construction yields multi-hop QA pairs by utilizing content from table triplets. (3) Omni-Modal Context Generation converts each table into text, image, and speech modalities. (4) Reasoning Diversification guarantees multiple reasoning paths via modality permutation.
  • ...and 19 more figures