Table of Contents
Fetching ...

Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception

Haoming Chen, Zhizhong Zhang, Yanyun Qu, Ruixin Zhang, Xin Tan, Yuan Xie

TL;DR

This work introduces CSC, a scene-level semantic consistency framework for universal 3D large-scale perception. By leveraging vision foundation models to generate coherent semantic prototypes and blending cross-modality information through a multi-modality prototype fusion module, CSC enforces cross-scene semantic alignment and reduces self-conflict in 3D pre-training. The approach yields state-of-the-art performance on semantic segmentation, object detection, and panoptic segmentation in annotation-efficient settings on nuScenes, demonstrating the practical impact of scene-level semantics for robust 3D understanding. The combination of VFM-assisted prototype generation and a prototype-based contrastive loss provides a scalable, label-efficient foundation for downstream 3D perception tasks.

Abstract

An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes. However, establishing such an ideal framework that is both task-generic and label-efficient poses a challenge in unifying the representation of the same primitive across diverse scenes. The current contrastive 3D pre-training methods typically follow a frame-level consistency, which focuses on the 2D-3D relationships in each detached image. Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict, i.e., the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges, we propose a CSC framework that puts a scene-level semantic consistency in the heart, bridging the connection of the similar semantic segments across various scenes. To achieve this goal, we combine the coherent semantic cues provided by the vision foundation model and the knowledge-rich cross-scene prototypes derived from the complementary multi-modality information. These allow us to train a universal 3D pre-training model that facilitates various downstream tasks with less fine-tuning efforts. Empirically, we achieve consistent improvements over SOTA pre-training approaches in semantic segmentation (+1.4% mIoU), object detection (+1.0% mAP), and panoptic segmentation (+3.0% PQ) using their task-specific 3D network on nuScenes. Code is released at https://github.com/chenhaomingbob/CSC, hoping to inspire future research.

Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception

TL;DR

This work introduces CSC, a scene-level semantic consistency framework for universal 3D large-scale perception. By leveraging vision foundation models to generate coherent semantic prototypes and blending cross-modality information through a multi-modality prototype fusion module, CSC enforces cross-scene semantic alignment and reduces self-conflict in 3D pre-training. The approach yields state-of-the-art performance on semantic segmentation, object detection, and panoptic segmentation in annotation-efficient settings on nuScenes, demonstrating the practical impact of scene-level semantics for robust 3D understanding. The combination of VFM-assisted prototype generation and a prototype-based contrastive loss provides a scalable, label-efficient foundation for downstream 3D perception tasks.

Abstract

An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes. However, establishing such an ideal framework that is both task-generic and label-efficient poses a challenge in unifying the representation of the same primitive across diverse scenes. The current contrastive 3D pre-training methods typically follow a frame-level consistency, which focuses on the 2D-3D relationships in each detached image. Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict, i.e., the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges, we propose a CSC framework that puts a scene-level semantic consistency in the heart, bridging the connection of the similar semantic segments across various scenes. To achieve this goal, we combine the coherent semantic cues provided by the vision foundation model and the knowledge-rich cross-scene prototypes derived from the complementary multi-modality information. These allow us to train a universal 3D pre-training model that facilitates various downstream tasks with less fine-tuning efforts. Empirically, we achieve consistent improvements over SOTA pre-training approaches in semantic segmentation (+1.4% mIoU), object detection (+1.0% mAP), and panoptic segmentation (+3.0% PQ) using their task-specific 3D network on nuScenes. Code is released at https://github.com/chenhaomingbob/CSC, hoping to inspire future research.
Paper Structure (16 sections, 4 equations, 2 figures, 5 tables)

This paper contains 16 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Brief illustration of the current multi-modality 3D pre-training paradigm and our proposed scene-level consistency. (a) we show the process of superpixel-superpoint association and clearly observe that superpixels with the same semantics can exist in the same scene or in different scenes, e.g., the green squares. (b) We summarize existing pre-training methods and find that they all use frame-level consistency to learn 3D representations. Moreover, we believe that this constraint breaks the semantic consistency across views/frames and visualize the drawback. (c) We show our proposed scene-level consistency, which keeping the semantic consistency across various scenes. As a result, we achieve SOTA on three perceptual tasks with limited 3D annotation (Tab. \ref{['tab:three_ann_eff_tasks']}). Our CSC framework builds a strong pre-training baseline for universal 3D Large-scale perception.
  • Figure 2: Overview of the CSC framework. CSC leverages the scene-level semantic consistency to obtain the universal 3D representations (Sec. \ref{['sec:methodology']}), and then fine-tunes the pre-trained 3D backbone for three downstream perception tasks (Sec. \ref{['sec:experiments']}). To achieve the scene-level semantic consistency, CSC consists of the VFM-assisted semantic prototype generation module (Sec. \ref{['sec:2d_3d_prototypes']}) and the coherent semantic consistency module (Sec. \ref{['sec:prototype-based-Constraint']}).