Table of Contents
Fetching ...

NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Yoonwoo Jeong, Jinwoo Lee, Chiheon Kim, Minsu Cho, Doyup Lee

TL;DR

NVS-Adapter provides a plug-and-play approach to novel view synthesis from a single image by freezing the pre-trained text-to-image model and injecting two trainable components: view-consistency cross-attention to align multi-view features and global semantic conditioning to inject reference semantics. By formulating joint view conditioning in the diffusion prior and integrating cross-attention at every U-Net block, it achieves geometrically coherent multi-view outputs without large-scale fine-tuning. The method demonstrates competitive NVS performance on Objaverse and Google Scanned Objects, compatibility with ControlNet and LoRA, and improved 3D reconstruction via SDS, while maintaining training efficiency. These results indicate practical utility for multi-view generation and 3D inference with limited labeled data and without updating billions of base parameters. Overall, NVS-Adapter advances plug-and-play adaptability for reliable 3D-consistent view synthesis from a single image, with strong generalization and compatibility with existing conditioning modules.

Abstract

Transfer learning of large-scale Text-to-Image (T2I) models has recently shown impressive potential for Novel View Synthesis (NVS) of diverse objects from a single image. While previous methods typically train large models on multi-view datasets for NVS, fine-tuning the whole parameters of T2I models not only demands a high cost but also reduces the generalization capacity of T2I models in generating diverse images in a new domain. In this study, we propose an effective method, dubbed NVS-Adapter, which is a plug-and-play module for a T2I model, to synthesize novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models. NVS-Adapter consists of two main components; view-consistency cross-attention learns the visual correspondences to align the local details of view features, and global semantic conditioning aligns the semantic structure of generated views with the reference view. Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views and also achieve high performance on benchmarks without full fine-tuning of T2I models. The code and data are publicly available in ~\href{https://postech-cvlab.github.io/nvsadapter/}{https://postech-cvlab.github.io/nvsadapter/}.

NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

TL;DR

NVS-Adapter provides a plug-and-play approach to novel view synthesis from a single image by freezing the pre-trained text-to-image model and injecting two trainable components: view-consistency cross-attention to align multi-view features and global semantic conditioning to inject reference semantics. By formulating joint view conditioning in the diffusion prior and integrating cross-attention at every U-Net block, it achieves geometrically coherent multi-view outputs without large-scale fine-tuning. The method demonstrates competitive NVS performance on Objaverse and Google Scanned Objects, compatibility with ControlNet and LoRA, and improved 3D reconstruction via SDS, while maintaining training efficiency. These results indicate practical utility for multi-view generation and 3D inference with limited labeled data and without updating billions of base parameters. Overall, NVS-Adapter advances plug-and-play adaptability for reliable 3D-consistent view synthesis from a single image, with strong generalization and compatibility with existing conditioning modules.

Abstract

Transfer learning of large-scale Text-to-Image (T2I) models has recently shown impressive potential for Novel View Synthesis (NVS) of diverse objects from a single image. While previous methods typically train large models on multi-view datasets for NVS, fine-tuning the whole parameters of T2I models not only demands a high cost but also reduces the generalization capacity of T2I models in generating diverse images in a new domain. In this study, we propose an effective method, dubbed NVS-Adapter, which is a plug-and-play module for a T2I model, to synthesize novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models. NVS-Adapter consists of two main components; view-consistency cross-attention learns the visual correspondences to align the local details of view features, and global semantic conditioning aligns the semantic structure of generated views with the reference view. Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views and also achieve high performance on benchmarks without full fine-tuning of T2I models. The code and data are publicly available in ~\href{https://postech-cvlab.github.io/nvsadapter/}{https://postech-cvlab.github.io/nvsadapter/}.
Paper Structure (44 sections, 6 equations, 18 figures, 10 tables)

This paper contains 44 sections, 6 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Overview of our framework. NVS-Adapter synthesizes novel views by incorporating two components into each U-Net block of a pre-trained T2I model: view-consistency cross-attention which aligns features of each target view with the relevant features of other views, and global semantic conditioning which aligns the features of target views with the global semantic structure of the reference.
  • Figure 1: Examples of NVS results of our NVS-Adapter with and without ControlNet controlnet variants.
  • Figure 2: Novel view synthesis examples by Zero-1-to-3 zero123, Zero123-XL objaverse_xl, and our NVS-Adapter with $N=1$ and $N=4$. Top images: NVS results conditioned on an image from Objaverse objaverse and GSO gso validation set. Bottom images: NVS results conditioned on a single image used in SyncDreamer syncdreamer. The first column presents a reference image, and the rest of four columns are synthesized views by each model. Note that the "Bottom images" do not have the ground truth images for target viewpoints since they are not from multi-view datasets.
  • Figure 2: Examples of NVS results of our NVS-Adapter with and without LoRA controlnet modules.
  • Figure 3: 3D reconstruction examples via Score Distillation Sampling (SDS) dreamfusion with baselines zero123objaverse_xl and our NVS-Adapter. Top images shows 3D reconstructions results conditioned on an image generated by T2I model, and bottom images shows results on an image in the wild.
  • ...and 13 more figures