Table of Contents
Fetching ...

GECO: Generative Image-to-3D within a SECOnd

Chen Wang, Jiatao Gu, Xiaoxiao Long, Yuan Liu, Lingjie Liu

TL;DR

GECO addresses the bottleneck and uncertainty of image-to-3D generation by combining a two-stage distillation pipeline. Stage I distills a pretrained multi-view diffusion model into a one-step multi-view generator, while Stage II uses pseudo ground-truth from this MV output to fine-tune a reconstruction-based 3D model for cross-view consistency. The result is a feed-forward 3D generator that delivers high-quality textured meshes in under a second on a single GPU, outperforming prior fast methods in both texture and geometry accuracy. This approach enables practical, real-time 3D asset creation from a single image with robust handling of viewpoint uncertainty.

Abstract

Recent years have seen significant advancements in 3D generation. While methods like score distillation achieve impressive results, they often require extensive per-scene optimization, which limits their time efficiency. On the other hand, reconstruction-based approaches are more efficient but tend to compromise quality due to their limited ability to handle uncertainty. We introduce GECO, a novel method for high-quality 3D generative modeling that operates within a second. Our approach addresses the prevalent issues of uncertainty and inefficiency in existing methods through a two-stage approach. In the first stage, we train a single-step multi-view generative model with score distillation. Then, a second-stage distillation is applied to address the challenge of view inconsistency in the multi-view generation. This two-stage process ensures a balanced approach to 3D generation, optimizing both quality and efficiency. Our comprehensive experiments demonstrate that GECO achieves high-quality image-to-3D mesh generation with an unprecedented level of efficiency. We will make the code and model publicly available.

GECO: Generative Image-to-3D within a SECOnd

TL;DR

GECO addresses the bottleneck and uncertainty of image-to-3D generation by combining a two-stage distillation pipeline. Stage I distills a pretrained multi-view diffusion model into a one-step multi-view generator, while Stage II uses pseudo ground-truth from this MV output to fine-tune a reconstruction-based 3D model for cross-view consistency. The result is a feed-forward 3D generator that delivers high-quality textured meshes in under a second on a single GPU, outperforming prior fast methods in both texture and geometry accuracy. This approach enables practical, real-time 3D asset creation from a single image with robust handling of viewpoint uncertainty.

Abstract

Recent years have seen significant advancements in 3D generation. While methods like score distillation achieve impressive results, they often require extensive per-scene optimization, which limits their time efficiency. On the other hand, reconstruction-based approaches are more efficient but tend to compromise quality due to their limited ability to handle uncertainty. We introduce GECO, a novel method for high-quality 3D generative modeling that operates within a second. Our approach addresses the prevalent issues of uncertainty and inefficiency in existing methods through a two-stage approach. In the first stage, we train a single-step multi-view generative model with score distillation. Then, a second-stage distillation is applied to address the challenge of view inconsistency in the multi-view generation. This two-stage process ensures a balanced approach to 3D generation, optimizing both quality and efficiency. Our comprehensive experiments demonstrate that GECO achieves high-quality image-to-3D mesh generation with an unprecedented level of efficiency. We will make the code and model publicly available.
Paper Structure (22 sections, 3 equations, 9 figures, 2 tables)

This paper contains 22 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: We propose GECO, a framework for feed-forward image-to-3D generation that produces texture meshes in 0.64s on a single L40 GPU. Here we show both the texture and geometry renderings of the generated meshes.
  • Figure 2: Overall pipeline of our feedforward 3D generator, which achieves image-to-3D mesh generation within one second given a conditional image and noise.
  • Figure 3: The two-stage learning pipeline for GECO. Stage I: the multi-view generator is optimized with VSD wang2023prolificdreamer objective with a pre-trained multi-view diffusion model shi2023zero123++; Stage II: the full model is optimized by predicting the rendering from the pre-trained reconstruction model tang2024lgm under the same image and noise condition.
  • Figure 4: Qualitative comparison against baseline methods. GECO outperforms the baselines, especially from the unseen views. For each method, the first row and second row are the texture and geometry renderings respectively.
  • Figure 5: Comparison of GECO with the baselines that use different multi-view image generation methods and then reconstruct 3D meshes. Our one-step multi-view generator produces much better results than Zero123 liu2023zero and is comparable to Zero123Plus shi2023zero123++ 75-step sampling, leading to better 3D renderings.
  • ...and 4 more figures