Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Ye Mao; Weixun Luo; Ranran Huang; Junpeng Jing; Krystian Mikolajczyk

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk

Abstract

Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Abstract

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Abstract

Paper Structure

Table of Contents

Figures (4)