Table of Contents
Fetching ...

360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

Huyen T. T. Tran, Van-Quang Nguyen, Farros Alferro, Kang-Jun Liu, Takayuki Okatani

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360° images remains largely underexplored. Unlike conventional images, 360° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs' capabilities to perceive 360° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360° VQA tasks. The source code and dataset will be publicly released upon acceptance.

360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360° images remains largely underexplored. Unlike conventional images, 360° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs' capabilities to perceive 360° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360° VQA tasks. The source code and dataset will be publicly released upon acceptance.
Paper Structure (18 sections, 1 equation, 5 figures, 4 tables)

This paper contains 18 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Example of an image from 360Bench in two common projection formats with relevant entities highlighted. (b) Responses of GPT-4o and Free360 (Ours) with correct answers highlighted in red. GPT-4o struggles with spatial reasoning, whereas Free360 correctly infers spatial relations. 360° image source: Insta360.
  • Figure 2: Statistics of 360Bench: (a) Task breakdown, (b) Image distribution, (c) Spatial distribution of annotated bounding boxes.
  • Figure 3: Illustration of tasks included in 360Bench. Correct answers and relevant entities are highlighted in red. 360° image sources: NOIRLabFlickHape.
  • Figure 4: Overview of Free360. The model performs question answering via scene graph generation through four steps (Sec. \ref{['subsec:SGG']}): (1) Entity Identification, detecting entities relevant to the question $Q$ from the CMP image $I$; (2) Attribute Extraction, deriving descriptive attributes from each entity crop; (3) Inter-Entity Relation Detection, capturing spatial relations among entities; and (4) Entity-View Relation Detection, modeling spatial relations between entities and camera views. The resulting scene graph, represented in textual form, is then fed into the MLLM to generate the final answer and reasoning analysis (Sec. \ref{['subsec:answer']}). 360° image source: Insta360.
  • Figure 5: Example of a scene graph in the serialized textual form.