Table of Contents
Fetching ...

ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting

Junbang Liu, Enpei Huang, Dongxing Mao, Hui Zhang, Xinyuan Song, Yongxin Ni

TL;DR

ContrastiveGaussian tackles single-view 3D generation by marrying diffusion-model priors with contrastive learning over a Gaussian splatting representation. It introduces a Quantity-Aware Triplet Loss and a super-resolution step to widen perceptual differences between samples, guiding sharper geometry and textures. The method operates in two stages—Gaussian splatting optimized via SDS and a mesh texture refinement stage—yielding high-fidelity results in roughly 80 seconds, with strong LPIPS and CLIP similarity gains over prior methods. This approach promises practical, fast, and consistent 3D content generation from a single image, with potential for further improvements in rear-view geometry and texture details.

Abstract

Creating 3D content from single-view images is a challenging problem that has attracted considerable attention in recent years. Current approaches typically utilize score distillation sampling (SDS) from pre-trained 2D diffusion models to generate multi-view 3D representations. Although some methods have made notable progress by balancing generation speed and model quality, their performance is often limited by the visual inconsistencies of the diffusion model outputs. In this work, we propose ContrastiveGaussian, which integrates contrastive learning into the generative process. By using a perceptual loss, we effectively differentiate between positive and negative samples, leveraging the visual inconsistencies to improve 3D generation quality. To further enhance sample differentiation and improve contrastive learning, we incorporate a super-resolution model and introduce another Quantity-Aware Triplet Loss to address varying sample distributions during training. Our experiments demonstrate that our approach achieves superior texture fidelity and improved geometric consistency.

ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting

TL;DR

ContrastiveGaussian tackles single-view 3D generation by marrying diffusion-model priors with contrastive learning over a Gaussian splatting representation. It introduces a Quantity-Aware Triplet Loss and a super-resolution step to widen perceptual differences between samples, guiding sharper geometry and textures. The method operates in two stages—Gaussian splatting optimized via SDS and a mesh texture refinement stage—yielding high-fidelity results in roughly 80 seconds, with strong LPIPS and CLIP similarity gains over prior methods. This approach promises practical, fast, and consistent 3D content generation from a single image, with potential for further improvements in rear-view geometry and texture details.

Abstract

Creating 3D content from single-view images is a challenging problem that has attracted considerable attention in recent years. Current approaches typically utilize score distillation sampling (SDS) from pre-trained 2D diffusion models to generate multi-view 3D representations. Although some methods have made notable progress by balancing generation speed and model quality, their performance is often limited by the visual inconsistencies of the diffusion model outputs. In this work, we propose ContrastiveGaussian, which integrates contrastive learning into the generative process. By using a perceptual loss, we effectively differentiate between positive and negative samples, leveraging the visual inconsistencies to improve 3D generation quality. To further enhance sample differentiation and improve contrastive learning, we incorporate a super-resolution model and introduce another Quantity-Aware Triplet Loss to address varying sample distributions during training. Our experiments demonstrate that our approach achieves superior texture fidelity and improved geometric consistency.

Paper Structure

This paper contains 13 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: ContrastiveGaussian Framework. There are two stages in our framework. In Stage 1, the input image is upscaled using a super-resolution model, followed by optimization of the 3D Gaussian representation through SDS loss and the Quantity-Aware Triplet Loss. After obtaining refined 3D Gaussian representation, we then convert it into a textured mesh. In Stage 2, the texture details of the generated mesh are further enhanced through the application of MSE loss.
  • Figure 2: Distortion Artifacts. Distortion artifacts can cause irregularities in both texture and geometry, as illustrated in this example.
  • Figure 3: Qualitative comparison. We compare our method with Zero-1-to-3 22, One-2-3-45 23, and DreamGaussian 10. The results show that our method provides superior visual quality and relatively faster generation speed.
  • Figure 4: Detailed comparison. We examine the finer details of the generated model, then rotate it to the left to check for any distortions.
  • Figure 5: Ablation Study. We ablate the proposed designs in our framework to verify their effectiveness.