Table of Contents
Fetching ...

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

Siyu Jiao, Haoye Dong, Yuyang Yin, Zequn Jie, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei

TL;DR

CLIP-GS tackles the challenge of learning unified vision-language representations for 3D data by leveraging 3D Gaussian Splatting (3DGS) instead of sparse point clouds. It introduces a GS Tokenizer and a transformer-based 3DGS encoder initialized with point-cloud pretraining, paired with an image voting loss to stabilize gradient optimization, and trains via cross-modal contrastive objectives against EVA-CLIP's text and image encoders. The method generates a scalable triplet corpus of 3DGS, rendered images, and captions (~240K triplets from Objaverse), enabling effective multimodal alignment and strong generalization to retrieval, zero-shot, and few-shot 3D tasks. Results show CLIP-GS surpasses prior point-cloud–based approaches across multimodal retrieval and 3D classification benchmarks, establishing 3DGS-based multimodal learning as a powerful direction with efficient data requirements. Overall, CLIP-GS provides a practical and scalable baseline for 3D multimodal learning that exploits texture-rich 3DGS representations and pre-trained vision-language priors.

Abstract

Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

TL;DR

CLIP-GS tackles the challenge of learning unified vision-language representations for 3D data by leveraging 3D Gaussian Splatting (3DGS) instead of sparse point clouds. It introduces a GS Tokenizer and a transformer-based 3DGS encoder initialized with point-cloud pretraining, paired with an image voting loss to stabilize gradient optimization, and trains via cross-modal contrastive objectives against EVA-CLIP's text and image encoders. The method generates a scalable triplet corpus of 3DGS, rendered images, and captions (~240K triplets from Objaverse), enabling effective multimodal alignment and strong generalization to retrieval, zero-shot, and few-shot 3D tasks. Results show CLIP-GS surpasses prior point-cloud–based approaches across multimodal retrieval and 3D classification benchmarks, establishing 3DGS-based multimodal learning as a powerful direction with efficient data requirements. Overall, CLIP-GS provides a practical and scalable baseline for 3D multimodal learning that exploits texture-rich 3DGS representations and pre-trained vision-language priors.

Abstract

Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.
Paper Structure (13 sections, 5 equations, 6 figures, 9 tables)

This paper contains 13 sections, 5 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: (a) Comparison between point cloud reconstruction and 3D Gaussian Splatting (3DGS) reconstruction. (b) The 3DGS approach outperforms point cloud methods across multiple 3D perception tasks, indicating its superior 3D object representation capabilities. These results suggest that 3D perception based on 3DGS holds significant advantages over point cloud-based methods.
  • Figure 2: Statistics of 3DGS Triplets.
  • Figure 3: Overview of the CLIP-GS. Within CLIP-GS, the FPS & kNN is first used to form gaussian patches. Then, we design the GS Tokenizer to obtain the serialized gaussian tokens. Finally, the entire sequence of Gaussian tokens is processed by a series of transformer layers that have been pre-trained on point clouds, resulting in the Gaussian features.
  • Figure 4: Details of GS refinement block.
  • Figure 5: Image / text $\rightarrow$ 3D shape retrieval results. Top: we query the most similar or top 2 similar 3D shapes for each text. Bottom: we take one or two images as inputs and retrieve the most similar 3D shape.
  • ...and 1 more figures