Table of Contents
Fetching ...

3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

Ting Huang, Zeyu Zhang, Hao Tang

TL;DR

3D-R1 addresses the gap in robust reasoning for 3D vision-language models by introducing a cold-start chain-of-thought dataset (Scene-30K) generated via a CoT data engine, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) incorporating format, perception, and semantic similarity rewards. A dynamic view selection mechanism further guides the model to select the most informative viewpoints for 3D scene understanding. The approach yields substantial improvements across seven diverse 3D benchmarks, demonstrating enhanced reasoning, grounding, and cross-modal alignment. This work combines synthetic CoT supervision, reward-driven policy optimization, and adaptive perception strategies to advance generalized, reasoning-rich 3D scene understanding in open-source, foundation-model-style VLMs.

Abstract

Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

TL;DR

3D-R1 addresses the gap in robust reasoning for 3D vision-language models by introducing a cold-start chain-of-thought dataset (Scene-30K) generated via a CoT data engine, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) incorporating format, perception, and semantic similarity rewards. A dynamic view selection mechanism further guides the model to select the most informative viewpoints for 3D scene understanding. The approach yields substantial improvements across seven diverse 3D benchmarks, demonstrating enhanced reasoning, grounding, and cross-modal alignment. This work combines synthetic CoT supervision, reward-driven policy optimization, and adaptive perception strategies to advance generalized, reasoning-rich 3D scene understanding in open-source, foundation-model-style VLMs.

Abstract

Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

Paper Structure

This paper contains 41 sections, 8 equations, 17 figures, 11 tables, 1 algorithm.

Figures (17)

  • Figure 1: 3D-R1 is an open-source generalist model that enhances the reasoning of 3D VLMs for unified scene understanding.
  • Figure 2: (a) Architecture. It takes text, multi-view images, 3D point clouds, and depth maps as input and formulates comprehensive 3D tasks as autoregressive sequence prediction. (b) Distribution of question types. Scene-30K contains diverse categories. (c) Multi-task performance. 3D-R1 demonstrates strong performance across various tasks. (d) Generalizability. 3D-R1 exhibits remarkable generalizability with enhanced reasoning capabilities.
  • Figure 3: CoT data engine. The point cloud of a scene is first sent to scene dscription generator to get a description of the scene. Then based on the description, we apply Gemini 2.5 Pro to synthetic CoT data.
  • Figure 4: The pipeline of Reinforcement Learning based GRPO. The policy model generates $N$ outputs from a point cloud and question. Then perception IoU, semantic CLIP-similarity, and format-adherence rewards are computed, grouped, and combined with a KL term to a frozen reference model to update the policy.
  • Figure 5: Performance surfaces under different dynamic view selection weight configurations. We analyze the influence of text relevance ($w_t$), spatial coverage ($w_c$), and CLIP-based similarity ($w_{\text{clip}}$) on model performance, with the constraint $w_c + w_{\text{clip}} = 1$. Results on 3D-QA (ScanQA) and 3D-VG (ScanRefer) reveal that optimal performance emerges when $w_t$ is within the range of 0.3 to 0.4, combined with balanced visual weights.
  • ...and 12 more figures