Table of Contents
Fetching ...

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

Zekai Lin, Xu Zheng

TL;DR

A reinforcement learning post-training framework based on Group Relative Policy Optimization with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency is proposed to enhance 3D reasoning on Equirectangular Projection images.

Abstract

360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

TL;DR

A reinforcement learning post-training framework based on Group Relative Policy Optimization with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency is proposed to enhance 3D reasoning on Equirectangular Projection images.

Abstract

360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.
Paper Structure (28 sections, 8 equations, 3 figures, 7 tables)

This paper contains 28 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 2: Overview of the PanoEnv-QA construction pipeline. We convert multi-view TartanAir data into ERP panoramas and generate geometry-grounded QA pairs using depth, semantics, and 3D projections.
  • Figure 3: Overview of our framework, including GRPO sampling, routed reward computation, and two-stage curriculum updates.
  • Figure 4: Training dynamics of our two-stage GRPO curriculum. Stage 1 quickly learns output format and structured decision-making; Stage 2 inherits this and focuses on improving OE reasoning under balanced training.