Table of Contents
Fetching ...

Diffusion Time-step Curriculum for One Image to 3D Generation

Xuanyu Yi, Zike Wu, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Hanwang Zhang

TL;DR

This work tackles the ill-posed problem of reconstructing 3D from a single image by redesigning the diffusion time-step strategy used in SDS. It introduces DTC123, a coarse-to-fine diffusion time-step curriculum that orchestrates annealed time-step sampling, progressive student representations (NeRF hash grids and DMTet), and a coarse-to-fine teacher prior ( Zero-1-to-3 for geometry, Stable Diffusion for texture) to improve geometry fidelity and texture detail. The method is implemented as a two-stage pipeline that also强化s reference-view restoration, and it demonstrates superior multi-view consistency and image-to-3D quality across NeRF4, RealFusion15, GSO, and Level50 benchmarks, including multi-instance generation. The approach is efficient, requiring only thousands of iterations on a single GPU, and provides a plug-and-play principle for leveraging diffusion priors in SDS-based 3D reconstruction.

Abstract

Score distillation sampling~(SDS) has been widely adopted to overcome the absence of unseen views in reconstructing 3D objects from a \textbf{single} image. It leverages pre-trained 2D diffusion models as teacher to guide the reconstruction of student 3D models. Despite their remarkable success, SDS-based methods often encounter geometric artifacts and texture saturation. We find out the crux is the overlooked indiscriminate treatment of diffusion time-steps during optimization: it unreasonably treats the student-teacher knowledge distillation to be equal at all time-steps and thus entangles coarse-grained and fine-grained modeling. Therefore, we propose the Diffusion Time-step Curriculum one-image-to-3D pipeline (DTC123), which involves both the teacher and student models collaborating with the time-step curriculum in a coarse-to-fine manner. Extensive experiments on NeRF4, RealFusion15, GSO and Level50 benchmark demonstrate that DTC123 can produce multi-view consistent, high-quality, and diverse 3D assets. Codes and more generation demos will be released in https://github.com/yxymessi/DTC123.

Diffusion Time-step Curriculum for One Image to 3D Generation

TL;DR

This work tackles the ill-posed problem of reconstructing 3D from a single image by redesigning the diffusion time-step strategy used in SDS. It introduces DTC123, a coarse-to-fine diffusion time-step curriculum that orchestrates annealed time-step sampling, progressive student representations (NeRF hash grids and DMTet), and a coarse-to-fine teacher prior ( Zero-1-to-3 for geometry, Stable Diffusion for texture) to improve geometry fidelity and texture detail. The method is implemented as a two-stage pipeline that also强化s reference-view restoration, and it demonstrates superior multi-view consistency and image-to-3D quality across NeRF4, RealFusion15, GSO, and Level50 benchmarks, including multi-instance generation. The approach is efficient, requiring only thousands of iterations on a single GPU, and provides a plug-and-play principle for leveraging diffusion priors in SDS-based 3D reconstruction.

Abstract

Score distillation sampling~(SDS) has been widely adopted to overcome the absence of unseen views in reconstructing 3D objects from a \textbf{single} image. It leverages pre-trained 2D diffusion models as teacher to guide the reconstruction of student 3D models. Despite their remarkable success, SDS-based methods often encounter geometric artifacts and texture saturation. We find out the crux is the overlooked indiscriminate treatment of diffusion time-steps during optimization: it unreasonably treats the student-teacher knowledge distillation to be equal at all time-steps and thus entangles coarse-grained and fine-grained modeling. Therefore, we propose the Diffusion Time-step Curriculum one-image-to-3D pipeline (DTC123), which involves both the teacher and student models collaborating with the time-step curriculum in a coarse-to-fine manner. Extensive experiments on NeRF4, RealFusion15, GSO and Level50 benchmark demonstrate that DTC123 can produce multi-view consistent, high-quality, and diverse 3D assets. Codes and more generation demos will be released in https://github.com/yxymessi/DTC123.
Paper Structure (28 sections, 1 theorem, 11 equations, 13 figures, 3 tables)

This paper contains 28 sections, 1 theorem, 11 equations, 13 figures, 3 tables.

Key Result

Theorem 1

(Diffusion Time-step Lower bound) Assume $p_{t}(\mathbf{x})$ is the noisy data distribution and $q_t(\mathbf{x}_t|\mathbf{x}_{\pi}) = {\mathcal{N}}(\mathbf{x}_t; \alpha_t \mathbf{x}_{\pi}, \sigma_t^2 {\mathbf{I}})$, for any $\mathbf{x}_t \sim q_t(\mathbf{x}_t|\mathbf{x}_{\pi})$, we have $\lVert \bol

Figures (13)

  • Figure 1: (a) SDS embraces an symbiotic teacher-student cycle with the training iteration progresses (Top). However, it entangles coarse-grained and fine-grained modeling with uniform sampling of time steps (Bottom) and equal treatment of student and teacher, where $k_1 \ldots k_3$ denotes the training iteration from early to late. (b) Our DTC123 follows the diffusion time-step curriculum, where larger time steps capture coarse-grained concept and smaller time steps focus on fine-grained details.
  • Figure 2: (a) Overall pipeline of DTC123, which have two optimization stages and includes the reference view reconstruction and unseen view imagination. (b) The zoom-in diagram of unseen view imagination with the proposed diffusion time-step curriculum.
  • Figure 3: Multi-instance generation by customized prompts.
  • Figure 4: Qualitative comparisons on image-to-3D generation. We randomly sample several new views to present, while other views and methods are included in Appendix. Our DTC123 consistently outperforms other state-of-the-art methods by generating multi-view consistent and high-fidelity results.
  • Figure 5: Ablation study on the component-wise contribution of DTC123. T-S denotes the Teacher-Student collaboration.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Definition 1
  • Theorem 1
  • proof