Table of Contents
Fetching ...

Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation

Kendong Liu, Zhiyu Zhu, Hui Liu, Junhui Hou

TL;DR

Acc3D tackles the problem of slow diffusion-based single-image to 3D reconstruction by introducing edge-consistency guided score distillation and disentangled adversarial regularization. The edge-consistency component focuses on stabilizing the endpoint score estimation in high-SNR regions to enable few-step generation, while the adversarial module enriches detail through dual discriminators that separately supervise geometry and texture. Together, these components yield over $20\\times$ speedups and improved 3D quality relative to strong baselines such as Era3D and Wonder3D, validated on Objaverse/GSO/DTC data with multiview rendering and NeuS reconstruction. The work provides theoretical insights into why edge-focused consistency improves endpoint estimation and demonstrates practical efficacy with extensive ablations, showing strong potential for real-time single-image to 3D diffusion pipelines.

Abstract

We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images. To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise. To this end, we propose edge consistency, i.e., consistent predictions across the high signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we propose an adversarial augmentation strategy to further enrich the generation detail and boost overall generation quality. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments demonstrate that our Acc3D not only achieves over a $20\times$ increase in computational efficiency but also yields notable quality improvements, compared to the state-of-the-arts.

Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation

TL;DR

Acc3D tackles the problem of slow diffusion-based single-image to 3D reconstruction by introducing edge-consistency guided score distillation and disentangled adversarial regularization. The edge-consistency component focuses on stabilizing the endpoint score estimation in high-SNR regions to enable few-step generation, while the adversarial module enriches detail through dual discriminators that separately supervise geometry and texture. Together, these components yield over speedups and improved 3D quality relative to strong baselines such as Era3D and Wonder3D, validated on Objaverse/GSO/DTC data with multiview rendering and NeuS reconstruction. The work provides theoretical insights into why edge-focused consistency improves endpoint estimation and demonstrates practical efficacy with extensive ablations, showing strong potential for real-time single-image to 3D diffusion pipelines.

Abstract

We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images. To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise. To this end, we propose edge consistency, i.e., consistent predictions across the high signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we propose an adversarial augmentation strategy to further enrich the generation detail and boost overall generation quality. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments demonstrate that our Acc3D not only achieves over a increase in computational efficiency but also yields notable quality improvements, compared to the state-of-the-arts.

Paper Structure

This paper contains 20 sections, 17 equations, 16 figures, 3 tables, 1 algorithm.

Figures (16)

  • Figure 1: Visual illustration of the generated high-quality multiview images and normal maps from a given single-view image by our Acc3D through fewer than four inference steps. Zoom in for details.
  • Figure 2: Overview of our Acc3D. The training pipeline unfolds in two core components: edge consistency-guided distillation and adversarial training. Each component bolsters the other's advantages—the distillation procedure stabilizes adversarial training, mitigating the risk of mode collapse, while adversarial learning can enhance perceptual richness. Collectively, these elements craft a balanced, refined model that excels in both stability and detail.
  • Figure 3: Illustration of the progressive score matching by our edge consistency distillation, in a data manifold view, where $\mathcal{M}_t$ indicates the data manifold with the timestamp $t$, e.g., $\mathcal{M}_T$ and $\mathcal{M}_0$ represent the manifolds of pure noise and clean samples, respectively; $\mathbf{X}_{0|T}$ represents the few-step estimation of the clean sample from Gaussian noise $\mathbf{X}_T$; $\widetilde{\mathbf{X}}_0$ represents a relatively accurate training target of $\mathbf{X}_{0|T}$, refined by the edge consistency region; and $\mathcal{E}(\cdot)$ is the generation error, e.g., distance between its corresponding manifold surface. (a) shows the single-step reverse trajectory by the endpoint (pure noise) score function; (b) represents the noised latent interpolation using noise $\mathbf{X}_T$ and $\mathbf{X}_{0|T}$; and (c) indicates the score estimation in the region with consistency characteristic. The left subfigure illustrates that before being adapted by the proposed strategy. The right subfigure shows that after being trained by our method, the error of the score function at the initial pure noise state is gradually decreased as $\mathcal{E}(\mathbf{X}_{0|T})>\mathcal{E}(\mathbf{X}_{0|T}')$. It indicates that the result $\mathbf{X}_{0|T}$ can gradually approach the data manifold.
  • Figure 4: Visual comparisons of our Acc3D, Era3D era3d, and Wonder3D wonder3d on the GSO gso dataset. For each sample, we provide the generated view, normal map, and reconstructed 3D mesh, displayed from left to right, respectively.
  • Figure 5: The qualitative results generated by Acc3D on various styles of images generated by text-to-image model Flux flux. Please refer to Fig. \ref{['fig:vsbase']} for more comprehensive comparisons between our Acc3D and baseline model Era3D. Zoom in for details.
  • ...and 11 more figures