Table of Contents
Fetching ...

Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method

Xinshen Zhang, Zhen Ye, Xu Zheng

TL;DR

This work addresses the gap in omnidirectional visual understanding by introducing OmniVQA, the first open ODI-VQA dataset and benchmark built on ERP panoramas, and detailing three question types that target polar-region distortions. It proposes a 360-R1 post-training framework using rule-based reinforcement learning with Group Relative Policy Optimization, employing three specialized rewards to improve reasoning, answer accuracy, and output formatting. Empirical results show that 360-R1 yields consistent improvements over multiple state-of-the-art MLLMs on the OmniVQA benchmark, driven by the structured reasoning guidance and a reasoning- and format-aware training regime. The contributions enable more reliable panoramic reasoning for applications in AR, embodied AI, and immersive systems, while acknowledging limitations such as dataset scale and indoor focus, and outlining future work on outdoor scenes, efficiency, and multimodal expansion.

Abstract

Omnidirectional images (ODIs), with their 360° field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360° imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).

Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method

TL;DR

This work addresses the gap in omnidirectional visual understanding by introducing OmniVQA, the first open ODI-VQA dataset and benchmark built on ERP panoramas, and detailing three question types that target polar-region distortions. It proposes a 360-R1 post-training framework using rule-based reinforcement learning with Group Relative Policy Optimization, employing three specialized rewards to improve reasoning, answer accuracy, and output formatting. Empirical results show that 360-R1 yields consistent improvements over multiple state-of-the-art MLLMs on the OmniVQA benchmark, driven by the structured reasoning guidance and a reasoning- and format-aware training regime. The contributions enable more reliable panoramic reasoning for applications in AR, embodied AI, and immersive systems, while acknowledging limitations such as dataset scale and indoor focus, and outlining future work on outdoor scenes, efficiency, and multimodal expansion.

Abstract

Omnidirectional images (ODIs), with their 360° field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360° imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).

Paper Structure

This paper contains 23 sections, 4 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Comparison of occlusion reasoning in 360° VQA. Given a panoramic scene and a spatial question about whether an object in the upper pole region is partially occluded, four models provide different levels of reasoning. 360-R1 demonstrates the most precise and comprehensive reasoning, identifying relevant spatial elements and producing a correct answer. QwenVL2.5-7B gives the correct answer but its explanation is partially flawed and lacks depth. In contrast, both JanusPro-7B and InternVL2.5-8B fail to answer correctly, primarily due to limited or inaccurate analysis of the upper pole region.
  • Figure 2: Error Cases in Omnidirectional Captioning. Three common errors by multimodal LLMs on 360° images: (a) misidentified objects; (b) incorrect object attributes or context; (c) hallucinated content unrelated to the image.
  • Figure 3: Overview of the OmniVQA Dataset Construction and 360-R1 Framework.
  • Figure 4: Iterative Refinement Pipeline.
  • Figure 5: Benchmark Construction Pipeline.
  • ...and 12 more figures