Table of Contents
Fetching ...

Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation

Jiantao Lin, Xin Yang, Meixi Chen, Yingjie Xu, Dongyu Yan, Leyi Wu, Xinli Xu, Lie XU, Shunsi Zhang, Ying-Cong Chen

TL;DR

Kiss3DGen repurposes pretrained 2D diffusion priors to generate complete 3D assets by learning a 3D Bundle Image (four views with normals) and reconstructing textured meshes, transforming 3D generation into a 2D image generation problem. It trains Kiss3DGen-Base with LoRA on a large, curated 3D dataset and augments capabilities with Kiss3DGen-ControlNet to enable 3D enhancement, editing, and image-to-3D tasks using ControlNet modules and tunable hyperparameters. The approach demonstrates strong, data-efficient performance across text-to-3D, text-to-multiview, and image-to-3D tasks, often surpassing state-of-the-art methods while requiring fewer training samples. The framework remains compatible with existing diffusion techniques and diffusion-based editing tools, offering a practical, scalable path for open-domain 3D content creation with broad applicability in AR/VR, gaming, and simulation.

Abstract

Diffusion models have achieved great success in generating 2D images. However, the quality and generalizability of 3D content generation remain limited. State-of-the-art methods often require large-scale 3D assets for training, which are challenging to collect. In this work, we introduce Kiss3DGen (Keep It Simple and Straightforward in 3D Generation), an efficient framework for generating, editing, and enhancing 3D objects by repurposing a well-trained 2D image diffusion model for 3D generation. Specifically, we fine-tune a diffusion model to generate ''3D Bundle Image'', a tiled representation composed of multi-view images and their corresponding normal maps. The normal maps are then used to reconstruct a 3D mesh, and the multi-view images provide texture mapping, resulting in a complete 3D model. This simple method effectively transforms the 3D generation problem into a 2D image generation task, maximizing the utilization of knowledge in pretrained diffusion models. Furthermore, we demonstrate that our Kiss3DGen model is compatible with various diffusion model techniques, enabling advanced features such as 3D editing, mesh and texture enhancement, etc. Through extensive experiments, we demonstrate the effectiveness of our approach, showcasing its ability to produce high-quality 3D models efficiently.

Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation

TL;DR

Kiss3DGen repurposes pretrained 2D diffusion priors to generate complete 3D assets by learning a 3D Bundle Image (four views with normals) and reconstructing textured meshes, transforming 3D generation into a 2D image generation problem. It trains Kiss3DGen-Base with LoRA on a large, curated 3D dataset and augments capabilities with Kiss3DGen-ControlNet to enable 3D enhancement, editing, and image-to-3D tasks using ControlNet modules and tunable hyperparameters. The approach demonstrates strong, data-efficient performance across text-to-3D, text-to-multiview, and image-to-3D tasks, often surpassing state-of-the-art methods while requiring fewer training samples. The framework remains compatible with existing diffusion techniques and diffusion-based editing tools, offering a practical, scalable path for open-domain 3D content creation with broad applicability in AR/VR, gaming, and simulation.

Abstract

Diffusion models have achieved great success in generating 2D images. However, the quality and generalizability of 3D content generation remain limited. State-of-the-art methods often require large-scale 3D assets for training, which are challenging to collect. In this work, we introduce Kiss3DGen (Keep It Simple and Straightforward in 3D Generation), an efficient framework for generating, editing, and enhancing 3D objects by repurposing a well-trained 2D image diffusion model for 3D generation. Specifically, we fine-tune a diffusion model to generate ''3D Bundle Image'', a tiled representation composed of multi-view images and their corresponding normal maps. The normal maps are then used to reconstruct a 3D mesh, and the multi-view images provide texture mapping, resulting in a complete 3D model. This simple method effectively transforms the 3D generation problem into a 2D image generation task, maximizing the utilization of knowledge in pretrained diffusion models. Furthermore, we demonstrate that our Kiss3DGen model is compatible with various diffusion model techniques, enabling advanced features such as 3D editing, mesh and texture enhancement, etc. Through extensive experiments, we demonstrate the effectiveness of our approach, showcasing its ability to produce high-quality 3D models efficiently.

Paper Structure

This paper contains 14 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A 3D Harry Potter scene built with Kiss3DGen. Our proposed framework, KISS3DGen, is a unified 3D generation framework that facilitates various 3D generation tasks, including text-to-3D, image-to-3D, 3D enhancement, editing and more. Specifically, most of the assets in the figure is generated from text (captioned with abbreviated text prompts) or image (marked by dash lines) conditions, while the main characters (Hermoine, Ron and Potter) are created using a hybrid pipeline that combines image-to-3D and text-guided mesh editing. Please zoom in for details and refer to our main paper for a more introduction.
  • Figure 2: The overview of our text-to-3D training and generation framework. In this work, we curate a high-quality text-3D dataset, then train a LoRA hu2022lora layer for text to 3D bundle image (Sec. \ref{['sec:method']}) generation upon a pretrained text-to-image diffusion transformer model with flow matching. Our framework generates 3D assets with text condition in two stages: the 3D-Bundle-Image generation (Stage I) and the 3D reconstruction (Stage II). In Stage I, we generate 3D bundle image with our Kiss3DGen base model guided by text prompts. In Stage II, we reconstruct the geometry and texture of the 3D asset via LRM xu2024instantmeshhong2024lrmlargereconstructionmodel or sphere initialization followed by optimization-based mesh refinement and texture projection approach, i.e., ISOMER wu2024unique3d. Zoom in for details.
  • Figure 3: 3D enhancement and editing with Kiss3DGen. In order to achieve high-quality image-to-3D generation, we incorporate the existing image-to-3D pipeline xu2024instantmesh with our general 3D enhancement pipeline. Please zoom in for details.
  • Figure 4: Qualitative comparisons with MVDream shi2023MVDream in text-to-multiview generation. In comparison, our method produces significantly better results in both text-image alignment and geometric coherence.
  • Figure 5: Qualitative comparisons with state-of-the-art methods for text-to-3D generation. It demonstrates that Kiss3DGen achieves the highest quality 3D mesh, delivering more accurate texture generation from the input prompts compared to others.
  • ...and 4 more figures