Table of Contents
Fetching ...

ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?

Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, Jing Liu, Xiongkuo Min, Guangtao Zhai

TL;DR

ODI-Bench introduces the first comprehensive ODI-oriented benchmark with 2,000 real-world omnidirectional images and 4,254 QA pairs across 10 tasks to evaluate MLLMs on both general- and spatial-level ODI understanding under dual evaluation modes. The study benchmarks 20 leading MLLMs, revealing that current models still struggle with immersive ODI contexts, especially spatial reasoning. To address this, the authors propose Omni-CoT, a training-free chain-of-thought framework comprising viewpoint-guided answering, crop cue grounding and refinement, and response refinement, which significantly boosts performance across tasks. The benchmark and accompanying Omni-CoT framework pave the way for more accurate and context-aware ODI understanding, with code and data to be released upon publication. This work advances embodied and panoramic vision understanding by highlighting current limitations and offering a practical method to improve MLLMs in immersive environments.

Abstract

Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs' comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.

ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?

TL;DR

ODI-Bench introduces the first comprehensive ODI-oriented benchmark with 2,000 real-world omnidirectional images and 4,254 QA pairs across 10 tasks to evaluate MLLMs on both general- and spatial-level ODI understanding under dual evaluation modes. The study benchmarks 20 leading MLLMs, revealing that current models still struggle with immersive ODI contexts, especially spatial reasoning. To address this, the authors propose Omni-CoT, a training-free chain-of-thought framework comprising viewpoint-guided answering, crop cue grounding and refinement, and response refinement, which significantly boosts performance across tasks. The benchmark and accompanying Omni-CoT framework pave the way for more accurate and context-aware ODI understanding, with code and data to be released upon publication. This work advances embodied and panoramic vision understanding by highlighting current limitations and offering a practical method to improve MLLMs in immersive environments.

Abstract

Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs' comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.

Paper Structure

This paper contains 53 sections, 1 equation, 15 figures, 7 tables.

Figures (15)

  • Figure 1: We introduce ODI-Bench, a comprehensive benchmark for omnidirectional image understanding, covering 10 diverse tasks with both close-ended and open-ended evaluation. To further improve model performance, we propose Omni-CoT, a chain-of-thought framework that enhances MLLMs’ comprehension on omnidirectional images via step-by-step reasoning.
  • Figure 2: Data distribution in ODI-Bench.
  • Figure 3: Construction procedures of ODI-Bench. (a) The benchmark images are carefully selected to ensure quality and diversity. (b) The majority tasks are manually annotated by human experts. (c) Instance-level QA pairs are generated through a dedicated annotation pipeline with human verification to guarantee quality.
  • Figure 4: We introduce Omni-CoT, The framework enhances VLMs’ comprehension of omnidirectional images via chain-of-thought reasoning through three steps: viewpoint-guided answering, grounding and refinement of crop cues, and response refinement. Compared with direct answering, Omni-CoT achieves notable performance improvements.
  • Figure 5: Illustration of omnidirectional image browsing. The ODI is viewed using a VR head-mounted display, with the corresponding viewpoints on the ERP projection shown in the right panel.
  • ...and 10 more figures