VRsketch2Gaussian: 3D VR Sketch Guided 3D Object Generation with Gaussian Splatting
Songen Gu, Haoxuan Song, Binjie Liu, Qian Yu, Sanyi Zhang, Haiyong Jiang, Jin Huang, Feng Tian
TL;DR
VRSketch2Gaussian addresses the challenge of VR sketch-conditioned 3D generation by leveraging a native 3D Gaussian Splatting representation and a two-stage Sketch-CLIP alignment to bridge sparse VR sketches to rich CLIP embeddings. It introduces VRSS, a large-scale multi-modal dataset pairing VR sketches with text, images, point clouds, and 3DGS to support robust training. The method fuses a Perceiver-based sketch reducer with CLIP text features and uses a diffusion-based 3D generator conditioned on fused sketch-text embeddings to produce detailed 3D Gaussians, with an Efficient Constrained Densification scheme to manage variable Gaussian counts. Evaluations on VRSS and FVRS demonstrate superior geometry and appearance fidelity, faster inference, and strong cross-modal alignment, suggesting practical impact for VR content creation and multi-modal 3D generation research.
Abstract
We propose VRSketch2Gaussian, a first VR sketch-guided, multi-modal, native 3D object generation framework that incorporates a 3D Gaussian Splatting representation. As part of our work, we introduce VRSS, the first large-scale paired dataset containing VR sketches, text, images, and 3DGS, bridging the gap in multi-modal VR sketch-based generation. Our approach features the following key innovations: 1) Sketch-CLIP feature alignment. We propose a two-stage alignment strategy that bridges the domain gap between sparse VR sketch embeddings and rich CLIP embeddings, facilitating both VR sketch-based retrieval and generation tasks. 2) Fine-Grained multi-modal conditioning. We disentangle the 3D generation process by using explicit VR sketches for geometric conditioning and text descriptions for appearance control. To facilitate this, we propose a generalizable VR sketch encoder that effectively aligns different modalities. 3) Efficient and high-fidelity 3D native generation. Our method leverages a 3D-native generation approach that enables fast and texture-rich 3D object synthesis. Experiments conducted on our VRSS dataset demonstrate that our method achieves high-quality, multi-modal VR sketch-based 3D generation. We believe our VRSS dataset and VRsketch2Gaussian method will be beneficial for the 3D generation community.
