Table of Contents
Fetching ...

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Dongxu Zhang, Yiding Sun, Pengcheng Li, Yumou Liu, Hongqiang Lin, Haoran Xu, Xiaoxuan Mu, Liang Lin, Wenbiao Yan, Ning Yang, Chaowei Fang, Juanjuan Zhao, Jihua Zhu, Conghui He, Cheng Tan

TL;DR

This work presents PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data by leveraging a dual-stream multi-modal architecture that synergizes semantic appearance with geometric truth.

Abstract

While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textit{Look, Think, then Answer} paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising $\sim$86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

TL;DR

This work presents PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data by leveraging a dual-stream multi-modal architecture that synergizes semantic appearance with geometric truth.

Abstract

While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textit{Look, Think, then Answer} paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising 86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.
Paper Structure (21 sections, 6 equations, 4 figures, 5 tables)

This paper contains 21 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Existing 3D-LLMs (left) treat geometric reasoning as a direct mapping process, often suffering from Geometric Hallucination, failing to perceive the missing leg and incorrectly judging the chair as stable. In contrast, our proposed PointCoT (right) introduces an explicit Look-Think-Answer paradigm. By generating a geometry-grounded rationale (detecting the missing rear-left leg) before the final conclusion, our method significantly reduces hallucinations and enables interpretable 3D reasoning.
  • Figure 2: The Data Construction Pipeline of Point-Reason-Instruct. The pipeline consists of three stages: (1) Dual-Stream Preprocessing, where objects are sampled into point clouds and rendered into 8 spherical views; (2) Multi-Task Reasoning Generation, where the Qwen2.5-VL teacher agent generates hierarchical CoT rationales covering geometric attributes, spatial relations, and functionality; and (3) Quality Filtering, which validates the rationales against metadata to eliminate hallucinations.
  • Figure 3: The overall architecture of PointCoT. The framework operates in a Look-Think-Answer paradigm. In the Look Stage, a dual-stream encoder extracts geometric and visual features, which are fused into a Tri-Modal Manifold $\mathbf{z}$ via Geometry-Guided Cross-Modal Attention. During the Think Stage, a VLM autoregressively generates an explicit rationale $\mathcal{R}$, while its hidden states $h_t$ are strictly grounded to $H_{geo}$ via an InfoNCE loss $\mathcal{L}_{anchor}$ to mitigate spatial hallucinations. Finally, in the Answer Stage, the final answer $\mathcal{A}$ is deduced conditionally on both $\mathbf{z}$ and $\mathcal{R}$. The entire pipeline is trained through a progressive dual-stage optimization using $\mathcal{L}_{gen}$ and $\mathcal{L}_{pred}$.
  • Figure 4: Qualitative comparison on the Point-Reason-Instruct benchmark. While the baseline relies on implicit semantic priors and suffers from geometric hallucinations, PointCoT follows a Look-Think-Answer paradigm. By actively selecting decisive viewpoints and explicitly grounding rationales in local 3D geometric structures, PointCoT effectively mitigates hallucinations and yields interpretable predictions.