Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method
Xinshen Zhang, Zhen Ye, Xu Zheng
TL;DR
This work addresses the gap in omnidirectional visual understanding by introducing OmniVQA, the first open ODI-VQA dataset and benchmark built on ERP panoramas, and detailing three question types that target polar-region distortions. It proposes a 360-R1 post-training framework using rule-based reinforcement learning with Group Relative Policy Optimization, employing three specialized rewards to improve reasoning, answer accuracy, and output formatting. Empirical results show that 360-R1 yields consistent improvements over multiple state-of-the-art MLLMs on the OmniVQA benchmark, driven by the structured reasoning guidance and a reasoning- and format-aware training regime. The contributions enable more reliable panoramic reasoning for applications in AR, embodied AI, and immersive systems, while acknowledging limitations such as dataset scale and indoor focus, and outlining future work on outdoor scenes, efficiency, and multimodal expansion.
Abstract
Omnidirectional images (ODIs), with their 360° field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360° imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).
