AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

Xinyue Liang; Zhiyuan Ma; Lingchen Sun; Yanjun Guo; Lei Zhang

AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

Xinyue Liang, Zhiyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang

TL;DR

AlignCVC tackles cross-view inconsistency in single-image-to-3D generation by reframing generation as distribution alignment toward a GT multi-view distribution. It combines a soft-aligned MVG with Score Distillation ASD and a hard-aligned reconstruction via adversarial supervision, forming a fast 3D-aware sampling loop that can operate with as few as $K=4$ diffusion steps. The approach is plug-and-play across MVG and reconstruction models, and experiments on Objaverse-derived data show consistent gains in CVC and standard 3D metrics across various model pairs, along with substantial speedups over prior 3D-aware sampling methods. While introducing additional GPU-memory overhead from auxiliary networks, AlignCVC delivers more robust, efficient single-image-to-3D generation with improved cross-view consistency and generalization.

Abstract

Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.

AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

TL;DR

Abstract

AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)