Table of Contents
Fetching ...

AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

Xinyue Liang, Zhiyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang

TL;DR

AlignCVC tackles cross-view inconsistency in single-image-to-3D generation by reframing generation as distribution alignment toward a GT multi-view distribution. It combines a soft-aligned MVG with Score Distillation ASD and a hard-aligned reconstruction via adversarial supervision, forming a fast 3D-aware sampling loop that can operate with as few as $K=4$ diffusion steps. The approach is plug-and-play across MVG and reconstruction models, and experiments on Objaverse-derived data show consistent gains in CVC and standard 3D metrics across various model pairs, along with substantial speedups over prior 3D-aware sampling methods. While introducing additional GPU-memory overhead from auxiliary networks, AlignCVC delivers more robust, efficient single-image-to-3D generation with improved cross-view consistency and generalization.

Abstract

Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.

AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

TL;DR

AlignCVC tackles cross-view inconsistency in single-image-to-3D generation by reframing generation as distribution alignment toward a GT multi-view distribution. It combines a soft-aligned MVG with Score Distillation ASD and a hard-aligned reconstruction via adversarial supervision, forming a fast 3D-aware sampling loop that can operate with as few as diffusion steps. The approach is plug-and-play across MVG and reconstruction models, and experiments on Objaverse-derived data show consistent gains in CVC and standard 3D metrics across various model pairs, along with substantial speedups over prior 3D-aware sampling methods. While introducing additional GPU-memory overhead from auxiliary networks, AlignCVC delivers more robust, efficient single-image-to-3D generation with improved cross-view consistency and generalization.

Abstract

Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.

Paper Structure

This paper contains 11 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our AlignCVC method jointly post-trains multi-view generation and reconstruction models with distribution alignment for 3D-aware sampling, enabling high-fidelity image-to-3D generation with only 4 diffusion steps for efficient inference.
  • Figure 2: Two typical MVG-based Image-to-3D approaches.
  • Figure 3: The impact of CVC in 3D-aware sampling. For Ouroboros3D wen2024ouroboros3d, VideoMV zuo2024videomv, and Gen-3Diffusion xue2024gen, noise and lack of CVC affect 3D reconstructions. VideoMV applies feedback starting at the 20th step, so its results are shown after its first feedback. Our model, integrating Wonder3D long2024wonder3d and GeoLRM zhang2024geolrm, delivers better results with high time efficiency.
  • Figure 4: The framework of AlignCVC. During training, the multi-view generation (MVG) student model generates multi-view images $\hat{\boldsymbol{X}}^{\pi}_{k}$ from input image $\boldsymbol{x}_c$ at camera poses $\pi$. A pre-trained MVG teacher model aligns $\hat{\boldsymbol{X}}^{\pi}_{k}$ with the GT distribution via a soft-alignment method. We then obtain the 3D model from the reconstruction model and adversarially supervise its renderings $\Tilde{\boldsymbol{X}}^{\pi}_{k}$ to the GT distribution in a hard-aligned manner. In the inference phase, we reconstruct an intermediate 3D model with the generated multi-view images $\hat{\boldsymbol{X}}^{\pi}_{k}$ at each timestep, where the renderings $\Tilde{\boldsymbol{X}}^{\pi}_{k}$ are then re-noised for the next denoising timestep with 3D-aware sampling. This recursive sampling, repeated for 4 steps, produces the final 3D model.
  • Figure 5: Comparison results on image-to-3D generation. Gen3Diff is short for Gen-3Diffusion.
  • ...and 1 more figures