Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?
Yifan Zhang, Junhui Hou
TL;DR
This work tackles the limitations of cross-modal contrastive distillation for 3D representation learning by showing that focusing solely on modality-shared information misses essential modality-specific cues. It introduces CMCR (Cross-Modal Comprehensive Representation Learning), which decouples modality-shared and modality-specific features, employs a multi-modal unified codebook for cross-modal alignment, and uses geometry-enhanced masked image modeling plus occupancy estimation to enrich 3D representations. The framework is supported by theoretical analysis that motivates combining shared information with reconstruction-based signals, and extensive experiments demonstrate superior performance across 3D semantic segmentation, object detection, and panoptic segmentation on diverse datasets, especially in low-label regimes. The approach offers practical benefits for scalable 3D perception and provides a solid foundation for future cross-modal 3D learning research.
Abstract
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. However, existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process, which leads to suboptimal representations. In this paper, we theoretically analyze the limitations of current contrastive methods for 3D representation learning and propose a new framework, namely CMCR (Cross-Modal Comprehensive Representation Learning), to address these shortcomings. Our approach improves upon traditional methods by better integrating both modality-shared and modality-specific features. Specifically, we introduce masked image modeling and occupancy estimation tasks to guide the network in learning more comprehensive modality-specific features. Furthermore, we propose a novel multi-modal unified codebook that learns an embedding space shared across different modalities. Besides, we introduce geometry-enhanced masked image modeling to further boost 3D representation learning. Extensive experiments demonstrate that our method mitigates the challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks. Code will be available at https://github.com/Eaphan/CMCR.
