Table of Contents
Fetching ...

VRsketch2Gaussian: 3D VR Sketch Guided 3D Object Generation with Gaussian Splatting

Songen Gu, Haoxuan Song, Binjie Liu, Qian Yu, Sanyi Zhang, Haiyong Jiang, Jin Huang, Feng Tian

TL;DR

VRSketch2Gaussian addresses the challenge of VR sketch-conditioned 3D generation by leveraging a native 3D Gaussian Splatting representation and a two-stage Sketch-CLIP alignment to bridge sparse VR sketches to rich CLIP embeddings. It introduces VRSS, a large-scale multi-modal dataset pairing VR sketches with text, images, point clouds, and 3DGS to support robust training. The method fuses a Perceiver-based sketch reducer with CLIP text features and uses a diffusion-based 3D generator conditioned on fused sketch-text embeddings to produce detailed 3D Gaussians, with an Efficient Constrained Densification scheme to manage variable Gaussian counts. Evaluations on VRSS and FVRS demonstrate superior geometry and appearance fidelity, faster inference, and strong cross-modal alignment, suggesting practical impact for VR content creation and multi-modal 3D generation research.

Abstract

We propose VRSketch2Gaussian, a first VR sketch-guided, multi-modal, native 3D object generation framework that incorporates a 3D Gaussian Splatting representation. As part of our work, we introduce VRSS, the first large-scale paired dataset containing VR sketches, text, images, and 3DGS, bridging the gap in multi-modal VR sketch-based generation. Our approach features the following key innovations: 1) Sketch-CLIP feature alignment. We propose a two-stage alignment strategy that bridges the domain gap between sparse VR sketch embeddings and rich CLIP embeddings, facilitating both VR sketch-based retrieval and generation tasks. 2) Fine-Grained multi-modal conditioning. We disentangle the 3D generation process by using explicit VR sketches for geometric conditioning and text descriptions for appearance control. To facilitate this, we propose a generalizable VR sketch encoder that effectively aligns different modalities. 3) Efficient and high-fidelity 3D native generation. Our method leverages a 3D-native generation approach that enables fast and texture-rich 3D object synthesis. Experiments conducted on our VRSS dataset demonstrate that our method achieves high-quality, multi-modal VR sketch-based 3D generation. We believe our VRSS dataset and VRsketch2Gaussian method will be beneficial for the 3D generation community.

VRsketch2Gaussian: 3D VR Sketch Guided 3D Object Generation with Gaussian Splatting

TL;DR

VRSketch2Gaussian addresses the challenge of VR sketch-conditioned 3D generation by leveraging a native 3D Gaussian Splatting representation and a two-stage Sketch-CLIP alignment to bridge sparse VR sketches to rich CLIP embeddings. It introduces VRSS, a large-scale multi-modal dataset pairing VR sketches with text, images, point clouds, and 3DGS to support robust training. The method fuses a Perceiver-based sketch reducer with CLIP text features and uses a diffusion-based 3D generator conditioned on fused sketch-text embeddings to produce detailed 3D Gaussians, with an Efficient Constrained Densification scheme to manage variable Gaussian counts. Evaluations on VRSS and FVRS demonstrate superior geometry and appearance fidelity, faster inference, and strong cross-modal alignment, suggesting practical impact for VR content creation and multi-modal 3D generation research.

Abstract

We propose VRSketch2Gaussian, a first VR sketch-guided, multi-modal, native 3D object generation framework that incorporates a 3D Gaussian Splatting representation. As part of our work, we introduce VRSS, the first large-scale paired dataset containing VR sketches, text, images, and 3DGS, bridging the gap in multi-modal VR sketch-based generation. Our approach features the following key innovations: 1) Sketch-CLIP feature alignment. We propose a two-stage alignment strategy that bridges the domain gap between sparse VR sketch embeddings and rich CLIP embeddings, facilitating both VR sketch-based retrieval and generation tasks. 2) Fine-Grained multi-modal conditioning. We disentangle the 3D generation process by using explicit VR sketches for geometric conditioning and text descriptions for appearance control. To facilitate this, we propose a generalizable VR sketch encoder that effectively aligns different modalities. 3) Efficient and high-fidelity 3D native generation. Our method leverages a 3D-native generation approach that enables fast and texture-rich 3D object synthesis. Experiments conducted on our VRSS dataset demonstrate that our method achieves high-quality, multi-modal VR sketch-based 3D generation. We believe our VRSS dataset and VRsketch2Gaussian method will be beneficial for the 3D generation community.

Paper Structure

This paper contains 13 sections, 14 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: We present VRsketch2Gaussian, a 3D generation pipeline guided by VR sketches. (1) The user provides a VR sketch and a text prompt. (2) A VR sketch encoder extracts a multi-modal aligned sketch embedding. (3) A 3D-native generation model synthesizes 3D Gaussians guided by the fused features of the VR sketch and text. (4) The final 3D object is represented in the 3D Gaussian.
  • Figure 2: Pipeline. Our method consists of two stages: (a) We first train a VR sketch encoder that aligns with the CLIP embedding space using contrastive learning, and (b) we use both sketch and text for multi-modal conditional generation.
  • Figure 3: Our Qualitative Results.
  • Figure 4: Sketch embedding distribution across categories
  • Figure 5: Geometry Comparison with LGM tangLGMLargeMultiView2024. Our method demontra better geometry and show less artifacts.