Table of Contents
Fetching ...

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang

TL;DR

ROVER addresses the lack of reciprocal cross-modal reasoning evaluation in unified multimodal models by introducing a two-setting, human-annotated benchmark that probes how language-guided reasoning can drive visual generation and how generated visuals can support verbal reasoning. It combines 1,312 tasks across 1,876 images with two task families (ROVER-IG and ROVER-TG) and a multi-dimensional evaluation protocol using a VLM judge (GPT-4.1) plus expert validation. Across 17 models, it reveals that cross-modal reasoning strongly influences visual output quality, that interleaved generation outperforms single-turn baselines, and that models struggle with symbolic abstractions, indicating dissociation between physical and symbolic reasoning. These findings highlight reciprocal cross-modal reasoning as a critical frontier for truly omnimodal generation and offer design guidance for future model design and evaluation.

Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

TL;DR

ROVER addresses the lack of reciprocal cross-modal reasoning evaluation in unified multimodal models by introducing a two-setting, human-annotated benchmark that probes how language-guided reasoning can drive visual generation and how generated visuals can support verbal reasoning. It combines 1,312 tasks across 1,876 images with two task families (ROVER-IG and ROVER-TG) and a multi-dimensional evaluation protocol using a VLM judge (GPT-4.1) plus expert validation. Across 17 models, it reveals that cross-modal reasoning strongly influences visual output quality, that interleaved generation outperforms single-turn baselines, and that models struggle with symbolic abstractions, indicating dissociation between physical and symbolic reasoning. These findings highlight reciprocal cross-modal reasoning as a critical frontier for truly omnimodal generation and offer design guidance for future model design and evaluation.

Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.

Paper Structure

This paper contains 19 sections, 18 figures, 4 tables.

Figures (18)

  • Figure 1: The ROVER benchmark. ROVER evaluates UMMs through reciprocal cross-modal reasoning: ROVER-IG (left) requires generating images with language-augmented reasoning, while ROVER-TG (right) requires generating text answers with visually-augmented reasoning.
  • Figure 2: Overview of ROVER-IG, the benchmark for evaluating how unified multimodal models generate images under intensive verbal reasoning. The benchmark spans $4$ domains (natural science, culture and art, common sense, and logic), instantiated across $7$ reasoning subtasks.
  • Figure 3: Overview of ROVER-TG, the benchmark for evaluating visually-augmented reasoning in verbal generation. The benchmark spans $3$ scenarios and $6$ subtasks: physical world modeling, logical assistance, and visual perception enhancement.
  • Figure 4: Example outputs on ROVER-TG. Each row corresponds to one reasoning scenario, with the input on the left and outputs from representative unified models shown across columns.
  • Figure 5: Cascade reasoning evaluation across EditWorld and ROVER benchmarks. We compare cascade approaches (FLUX+GPT with GPT-4o prompt refinement) against UMMs.
  • ...and 13 more figures