CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting
Siyu Jiao, Haoye Dong, Yuyang Yin, Zequn Jie, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei
TL;DR
CLIP-GS tackles the challenge of learning unified vision-language representations for 3D data by leveraging 3D Gaussian Splatting (3DGS) instead of sparse point clouds. It introduces a GS Tokenizer and a transformer-based 3DGS encoder initialized with point-cloud pretraining, paired with an image voting loss to stabilize gradient optimization, and trains via cross-modal contrastive objectives against EVA-CLIP's text and image encoders. The method generates a scalable triplet corpus of 3DGS, rendered images, and captions (~240K triplets from Objaverse), enabling effective multimodal alignment and strong generalization to retrieval, zero-shot, and few-shot 3D tasks. Results show CLIP-GS surpasses prior point-cloud–based approaches across multimodal retrieval and 3D classification benchmarks, establishing 3DGS-based multimodal learning as a powerful direction with efficient data requirements. Overall, CLIP-GS provides a practical and scalable baseline for 3D multimodal learning that exploits texture-rich 3DGS representations and pre-trained vision-language priors.
Abstract
Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.
