Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

Yabo Chen; Jiemin Fang; Yuyang Huang; Taoran Yi; Xiaopeng Zhang; Lingxi Xie; Xinggang Wang; Wenrui Dai; Hongkai Xiong; Qi Tian

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong, Qi Tian

TL;DR

This work tackles the problem of reproducing geometrically and visually consistent 3D views from a single image. It introduces Cascade-Zero123, a two-stage cascade of Zero-1-to-3 models (Base-0123 and Refiner-0123) that generates self-prompted nearby views to progressively reveal 3D structure, with self-distillation EMA updating Base-0123 to improve consistency. The approach leverages cross-attention over multi-view prompts and scores using SDS-based optimization to synthesize the final target view. Experiments on Objaverse, GSO, RTMV, RealFusion15 demonstrate substantial gains in view consistency and 3D fidelity, especially for complex scenes, showing strong generalization and practical potential for single-image to 3D reconstruction.

Abstract

Synthesizing multi-view 3D from one single image is a significant but challenging task. Zero-1-to-3 methods have achieved great success by lifting a 2D latent diffusion model to the 3D scope. The target view image is generated with a single-view source image and the camera pose as condition information. However, due to the high sparsity of the single input image, Zero-1-to-3 tends to produce geometry and appearance inconsistency across views, especially for complex objects. To tackle this issue, we propose to supply more condition information for the generation model but in a self-prompt way. A cascade framework is constructed with two Zero-1-to-3 models, named Cascade-Zero123, which progressively extract 3D information from the source image. Specifically, several nearby views are first generated by the first model and then fed into the second-stage model along with the source image as generation conditions. With amplified self-prompted condition images, our Cascade-Zero123 generates more consistent novel-view images than Zero-1-to-3. Experiment results demonstrate remarkable promotion, especially for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. More demos and code are available at https://cascadezero123.github.io.

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

TL;DR

Abstract

Paper Structure (29 sections, 13 equations, 7 figures, 8 tables)

This paper contains 29 sections, 13 equations, 7 figures, 8 tables.

Introduction
Related Work
Single Image to 3D
Multi-stage Diffusion Models
Methods
Preliminary
Cascade-Zero123 Framework
Base-0123 Framework
Refiner-0123 Framework
Self-Distillation Design
Inference
Experiments
Datasets
Implementation Details
Metrics
...and 14 more sections

Figures (7)

Figure 1: Rather than adopting limited input information, which Zero-1-to-3 liu2023zero1to3 generation pipeline only has a single-view source image, Cascade-Zero123 progressively extracts the 3D information from more condition images by self-prompting. View-consistent images can be generated in a cascade manner. Cascade-Zero123 shows the strong capability on various complex objects, e.g. insects, robots, or multiple objects stacked.
Figure 2: The performance comparison of Zero-1-to-3 liu2023zero1to3 and our methods on Google Scanned Object downs2022gso with different camera pose rotation angles. When the camera pose changes drastically, the synthesis quality of Zero-1-to-3 will drop drastically. But our method can promote the synthesis quality in all the transition ranges of camera poses.
Figure 3: The architecture of Cascade-Zero123. Cascade-Zero123 can be divided into two parts. The left part is Base-0123, which takes a set of R and T values as input to generate corresponding multi-view images. These output images are concatenated with the input condition image and its corresponding camera pose, forming a self-prompted input denoted as a set of $c(x_c,\Delta R, \Delta T)$ for the right part Refiner-0123. The corresponding camera pose transition for each condition image to the target image needs to be recalculated as shown in detailed camera pose rotations. After each iteration of training, Base-0123 is updated through exponential moving average (EMA) using Refiner-0123.
Figure 4: Novel view synthesis compared with Zero123-XL objaverseXL, and SyncDreamer liu2023syncdreamer, where Zero123-XL is Zero-1-to-3 pre-trained on Objaverse-XL datasets objaverseXL, achieving higher performance. We selected some challenging scenes, including stacked objects, parallel objects, and objects with multiple branches. Zero123-XL exhibits good quality in image generation but lacks consistency in these complex scenes. SyncDreamer demonstrates good consistency but struggles to maintain good quality in image generation. Our model, however, maintains both quality and consistency in these scenarios.
Figure 5: Single image to 3D reconstruction using SDS loss poole2022dreamfusion compared with Zero123-XL. The first two rows illustrate that Cascade-Zero123 can correct the problem that Zero-1-to-3 sometimes learns inaccurate colors of the backside. The middle two lines describe how Cascade-Zero123 can rectify structural errors through multi-view self-prompting. The last two lines indicate that Cascade-Zero123 can address the problem of transparent or high-brightness objects being mistakenly learned as white clouds.
...and 2 more figures

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

TL;DR

Abstract

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

Authors

TL;DR

Abstract

Table of Contents

Figures (7)