Table of Contents
Fetching ...

FlexiDreamer: Single Image-to-3D Generation with FlexiCubes

Ruowen Zhao, Zhengyi Wang, Yikai Wang, Zihan Zhou, Jun Zhu

TL;DR

FlexiDreamer tackles the challenge of generating high-fidelity textured meshes from a single image by integrating an end-to-end mesh reconstruction framework based on FlexiCubes with multi-view diffusion outputs. The approach introduces a hybrid positional encoding and an orientation-aware texture mapping to mitigate geometric distortions and surface ghosting, complemented by eikonal and smooth regularizations to reduce holes and noise. The method produces textured meshes in approximately 1 minute on a single A100 GPU and outperforms prior single-image-to-3D methods in geometry and texture quality while avoiding post-processing steps like Marching Cubes. This work enables rapid, high-quality 3D content creation from minimal input, with broad implications for 3D Content generation pipelines.

Abstract

3D content generation has wide applications in various fields. One of its dominant paradigms is by sparse-view reconstruction using multi-view images generated by diffusion models. However, since directly reconstructing triangle meshes from multi-view images is challenging, most methodologies opt to an implicit representation (such as NeRF) during the sparse-view reconstruction and acquire the target mesh by a post-processing extraction. However, the implicit representation takes extensive time to train and the post-extraction also leads to undesirable visual artifacts. In this paper, we propose FlexiDreamer, a novel framework that directly reconstructs high-quality meshes from multi-view generated images. We utilize an advanced gradient-based mesh optimization, namely FlexiCubes, for multi-view mesh reconstruction, which enables us to generate 3D meshes in an end-to-end manner. To address the reconstruction artifacts owing to the inconsistencies from generated images, we design a hybrid positional encoding scheme to improve the reconstruction geometry and an orientation-aware texture mapping to mitigate surface ghosting. To further enhance the results, we respectively incorporate eikonal and smooth regularizations to reduce geometric holes and surface noise. Our approach can generate high-fidelity 3D meshes in the single image-to-3D downstream task with approximately 1 minute, significantly outperforming previous methods.

FlexiDreamer: Single Image-to-3D Generation with FlexiCubes

TL;DR

FlexiDreamer tackles the challenge of generating high-fidelity textured meshes from a single image by integrating an end-to-end mesh reconstruction framework based on FlexiCubes with multi-view diffusion outputs. The approach introduces a hybrid positional encoding and an orientation-aware texture mapping to mitigate geometric distortions and surface ghosting, complemented by eikonal and smooth regularizations to reduce holes and noise. The method produces textured meshes in approximately 1 minute on a single A100 GPU and outperforms prior single-image-to-3D methods in geometry and texture quality while avoiding post-processing steps like Marching Cubes. This work enables rapid, high-quality 3D content creation from minimal input, with broad implications for 3D Content generation pipelines.

Abstract

3D content generation has wide applications in various fields. One of its dominant paradigms is by sparse-view reconstruction using multi-view images generated by diffusion models. However, since directly reconstructing triangle meshes from multi-view images is challenging, most methodologies opt to an implicit representation (such as NeRF) during the sparse-view reconstruction and acquire the target mesh by a post-processing extraction. However, the implicit representation takes extensive time to train and the post-extraction also leads to undesirable visual artifacts. In this paper, we propose FlexiDreamer, a novel framework that directly reconstructs high-quality meshes from multi-view generated images. We utilize an advanced gradient-based mesh optimization, namely FlexiCubes, for multi-view mesh reconstruction, which enables us to generate 3D meshes in an end-to-end manner. To address the reconstruction artifacts owing to the inconsistencies from generated images, we design a hybrid positional encoding scheme to improve the reconstruction geometry and an orientation-aware texture mapping to mitigate surface ghosting. To further enhance the results, we respectively incorporate eikonal and smooth regularizations to reduce geometric holes and surface noise. Our approach can generate high-fidelity 3D meshes in the single image-to-3D downstream task with approximately 1 minute, significantly outperforming previous methods.
Paper Structure (32 sections, 4 equations, 10 figures, 1 table)

This paper contains 32 sections, 4 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: FlexiDreamer for single image-to-3D generation: FlexiDreamer can reconstruct 3D content with detailed geometry and accurate appearance from a single image. We are able to generate a premium textured mesh in approximately one minute.
  • Figure 2: The pipeline of FlexiDreamer. The inference image is fed into multi-view diffusion model to generate multi-view images. Then an end-to-end reconstruction framework based FlexiCubes is trained end-to-end for a high-quality mesh. The mesh extracted from signed distance function can be iteratively optimized by minimizing the difference between its rendering images and multi-view generated images.
  • Figure 3: The qualitative comparisons with baselines in terms of the generated textured meshes. It reveals a superior performance of FlexiDreamer in reconstructing both geometry and texture details from single-view images.
  • Figure 4: Results of using different encoding scheme. Compared to fourier encoding and hashgrid encoding scheme, our model with hybrid encoding generates the most accurate geometry and texture.
  • Figure 5: Ablation study on the smooth regularizations and orientation texture mapping strategy. For smooth regularization, the laplacian smoothing term can smooth the surface of the mesh at a global scale while normal consistency constraint helps reduce high-frequency noises. For orientation texture mapping, it can be seen that it is helpful for mitigating surface ghosting.
  • ...and 5 more figures