Table of Contents
Fetching ...

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk

Abstract

Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Abstract

Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/

Paper Structure

This paper contains 19 sections, 8 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Overview of UniScene3D.Top: UniScene3D takes multi-view images and pointmaps as input to learn 3D representations for viewpoint grounding, scene retrieval, zero-/few-shot scene type classification, and 3D visual question answering. Bottom: Example of viewpoint grounding. Image appearance cues enable correct color recognition (left), while pointmap geometry supports reasoning about spatial extent, enabling identification of the longest seat (right). The radar chart shows comparisons between UniScene3D and prior state-of-the-art methods across multiple tasks and benchmarks.
  • Figure 2: Overview of UniScene3D pretraining. UniScene3D takes multi-view image–pointmap pairs as input and performs early fusion at the patch embedding stage. The fused tokens, added with absolute positional encodings, are then processed by $N$ Transformer blocks to produce a unified colored pointmap representation. During pretraining, UniScene3D is optimized with four alignment objectives: (1) Cross-view geometric alignment$\mathcal{L}_{\text{geo}}$; (2) Grounded view alignment$\mathcal{L}_{\text{ground}}$; (3) View-level alignment$\mathcal{L}_{\text{view}}$; and (4) Scene-level alignment$\mathcal{L}_{\text{scene}}$. Its blocks are highlighted in cyan.
  • Figure 3: Qualitative viewpoint grounding results. Correct and incorrect matches are highlighted in green and red, respectively, based on the referring texts. Underlined phrases denote contextual clues that guide the models in solving the grounding task.
  • Figure 4: Qualitative viewpoint grounding results. Correct and incorrect matches are highlighted in green and red, respectively, based on the referring texts. Underlined phrases denote contextual clues that guide the models in solving the grounding task.