Table of Contents
Fetching ...

Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma

TL;DR

Ctrl123 tackles the persistent pose- and appearance-consistency gap in single-image novel view synthesis (NVS) by introducing a closed-loop transcription framework that aligns generated views with ground-truth targets in a pose-sensitive latent space using patch features. It extends prior open-loop diffusion models by feeding generated views back through the encoder and enforcing a latent-space consistency loss, specifically on patch-level features, to significantly improve pose alignment (AA) and silhouette overlap (IoU). The method employs an alternating training strategy (closed-loop rounds plus standard diffusion fine-tuning) and demonstrates strong performance gains on 25 training-object subsets and large Objaverse-based datasets, including substantial improvements in 3D reconstruction quality. Overall, Ctrl123 highlights the effectiveness of closed-loop transcription for ensuring cross-view consistency in diffusion-based NVS and points to broader applicability of CTRL-style constraints in 3D-aware generative tasks.

Abstract

Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction. We realize that such inconsistency is largely due to the fact that it is difficult to enforce accurate pose and appearance alignment directly in the diffusion training, as mostly done by existing methods such as Zero123. To remedy this problem, we propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space. Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks of NVS and 3D reconstruction, achieving significant improvements in both multiview-consistency and pose-consistency over existing methods.

Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

TL;DR

Ctrl123 tackles the persistent pose- and appearance-consistency gap in single-image novel view synthesis (NVS) by introducing a closed-loop transcription framework that aligns generated views with ground-truth targets in a pose-sensitive latent space using patch features. It extends prior open-loop diffusion models by feeding generated views back through the encoder and enforcing a latent-space consistency loss, specifically on patch-level features, to significantly improve pose alignment (AA) and silhouette overlap (IoU). The method employs an alternating training strategy (closed-loop rounds plus standard diffusion fine-tuning) and demonstrates strong performance gains on 25 training-object subsets and large Objaverse-based datasets, including substantial improvements in 3D reconstruction quality. Overall, Ctrl123 highlights the effectiveness of closed-loop transcription for ensuring cross-view consistency in diffusion-based NVS and points to broader applicability of CTRL-style constraints in 3D-aware generative tasks.

Abstract

Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction. We realize that such inconsistency is largely due to the fact that it is difficult to enforce accurate pose and appearance alignment directly in the diffusion training, as mostly done by existing methods such as Zero123. To remedy this problem, we propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space. Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks of NVS and 3D reconstruction, achieving significant improvements in both multiview-consistency and pose-consistency over existing methods.
Paper Structure (43 sections, 9 equations, 11 figures, 12 tables)

This paper contains 43 sections, 9 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Ctrl123 generates more pose and appearance consistent novel views from a single image of an arbitrary object.
  • Figure 2: A qualitative comparison between the generated novel views and their corresponding ground truth for an object from the training set. Both Zero123-XL deitke2023objaverse and Zero123++ shi2023zero123++ fail to generate results highly consistent with the ground truth in terms of pose and appearance, while Ctrl123 can significantly improve consistency in the generated novel views.
  • Figure 3: Comparison between the training pipeline of current open-loop NVS models and closed-loop Ctrl123.
  • Figure 4: Qualitative comparison of NVS generalization capability on GSO (Left) and RTMV (Right) after training on large-scale dataset (100K). More cases can be found in Appendix \ref{['appendix:train_on_100k']}.
  • Figure 5: NVS/3D examples using Ctrl123 on images from training set (100K).
  • ...and 6 more figures