Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views
Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong, Qi Tian
TL;DR
This work tackles the problem of reproducing geometrically and visually consistent 3D views from a single image. It introduces Cascade-Zero123, a two-stage cascade of Zero-1-to-3 models (Base-0123 and Refiner-0123) that generates self-prompted nearby views to progressively reveal 3D structure, with self-distillation EMA updating Base-0123 to improve consistency. The approach leverages cross-attention over multi-view prompts and scores using SDS-based optimization to synthesize the final target view. Experiments on Objaverse, GSO, RTMV, RealFusion15 demonstrate substantial gains in view consistency and 3D fidelity, especially for complex scenes, showing strong generalization and practical potential for single-image to 3D reconstruction.
Abstract
Synthesizing multi-view 3D from one single image is a significant but challenging task. Zero-1-to-3 methods have achieved great success by lifting a 2D latent diffusion model to the 3D scope. The target view image is generated with a single-view source image and the camera pose as condition information. However, due to the high sparsity of the single input image, Zero-1-to-3 tends to produce geometry and appearance inconsistency across views, especially for complex objects. To tackle this issue, we propose to supply more condition information for the generation model but in a self-prompt way. A cascade framework is constructed with two Zero-1-to-3 models, named Cascade-Zero123, which progressively extract 3D information from the source image. Specifically, several nearby views are first generated by the first model and then fed into the second-stage model along with the source image as generation conditions. With amplified self-prompted condition images, our Cascade-Zero123 generates more consistent novel-view images than Zero-1-to-3. Experiment results demonstrate remarkable promotion, especially for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. More demos and code are available at https://cascadezero123.github.io.
