Table of Contents
Fetching ...

NViST: In the Wild New View Synthesis from a Single Image with Transformers

Wonbong Jang, Lourdes Agapito

TL;DR

NViST addresses the challenge of true in-the-wild novel-view synthesis from a single image by using a transformer-based encoder–decoder that operates on real-world MVImgNet data with relative 6-DOF camera poses. The encoder leverages a finetuned MAE to produce geometry-aware features, while the decoder employs cross-attention to map those features to a vector-matrix radiance field conditioned by camera parameters via adaptive layer normalization, followed by NeRF-style volume rendering. Empirical results on MVImgNet (including unseen categories and casual phone captures) and ShapeNet-SRN demonstrate strong generalization and competitive performance, with ablations highlighting the benefits of relative pose, VM representation, LPIPS loss, and encoder updating. This work advances practical single-image NVS by removing the canonicalization requirement and enabling background-inclusive, real-world scene synthesis, with potential impact on AR/VR, robotics, and 3D content creation.

Abstract

We propose NViST, a transformer-based model for efficient and generalizable novel-view synthesis from a single image for real-world scenes. In contrast to many methods that are trained on synthetic data, object-centred scenarios, or in a category-specific manner, NViST is trained on MVImgNet, a large-scale dataset of casually-captured real-world videos of hundreds of object categories with diverse backgrounds. NViST transforms image inputs directly into a radiance field, conditioned on camera parameters via adaptive layer normalisation. In practice, NViST exploits fine-tuned masked autoencoder (MAE) features and translates them to 3D output tokens via cross-attention, while addressing occlusions with self-attention. To move away from object-centred datasets and enable full scene synthesis, NViST adopts a 6-DOF camera pose model and only requires relative pose, dropping the need for canonicalization of the training data, which removes a substantial barrier to it being used on casually captured datasets. We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures. We conduct qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that our model represents a step forward towards enabling true in-the-wild generalizable novel-view synthesis from a single image. Project webpage: https://wbjang.github.io/nvist_webpage.

NViST: In the Wild New View Synthesis from a Single Image with Transformers

TL;DR

NViST addresses the challenge of true in-the-wild novel-view synthesis from a single image by using a transformer-based encoder–decoder that operates on real-world MVImgNet data with relative 6-DOF camera poses. The encoder leverages a finetuned MAE to produce geometry-aware features, while the decoder employs cross-attention to map those features to a vector-matrix radiance field conditioned by camera parameters via adaptive layer normalization, followed by NeRF-style volume rendering. Empirical results on MVImgNet (including unseen categories and casual phone captures) and ShapeNet-SRN demonstrate strong generalization and competitive performance, with ablations highlighting the benefits of relative pose, VM representation, LPIPS loss, and encoder updating. This work advances practical single-image NVS by removing the canonicalization requirement and enabling background-inclusive, real-world scene synthesis, with potential impact on AR/VR, robotics, and 3D content creation.

Abstract

We propose NViST, a transformer-based model for efficient and generalizable novel-view synthesis from a single image for real-world scenes. In contrast to many methods that are trained on synthetic data, object-centred scenarios, or in a category-specific manner, NViST is trained on MVImgNet, a large-scale dataset of casually-captured real-world videos of hundreds of object categories with diverse backgrounds. NViST transforms image inputs directly into a radiance field, conditioned on camera parameters via adaptive layer normalisation. In practice, NViST exploits fine-tuned masked autoencoder (MAE) features and translates them to 3D output tokens via cross-attention, while addressing occlusions with self-attention. To move away from object-centred datasets and enable full scene synthesis, NViST adopts a 6-DOF camera pose model and only requires relative pose, dropping the need for canonicalization of the training data, which removes a substantial barrier to it being used on casually captured datasets. We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures. We conduct qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that our model represents a step forward towards enabling true in-the-wild generalizable novel-view synthesis from a single image. Project webpage: https://wbjang.github.io/nvist_webpage.
Paper Structure (19 sections, 7 equations, 12 figures, 3 tables)

This paper contains 19 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: We introduce NViST, a transformer-based architecture that enables synthesis from novel viewpoints given a single in the wild input image. We test our model not only on held-out scenes of MVImgNet, a large-scale dataset of casually captured videos of hundreds of object categories (Right) but also on out-of-distribution challenging phone-captured scenes (Left).
  • Figure 2: Architecture. NViST is a feed-forward transformer-based model that takes a single in-the-wild image as input, and renders a novel view. The encoder, a finetuned Masked Autoencoder (MAE), generates feature tokens, which are translated to output tokens via cross-attention by our novel decoder, conditioned on normalised focal length and camera distance via adaptive layer normalisation. Self-attention blocks allow reasoning about occlusions. Output tokens are reshaped into a vector-matrix representation that is used for volume rendering. NViST is trained end-to-end via a balance of losses: photometric $L_2$, perceptual $L_{\text{LPIPS}}$, and a distortion-based regulariser $L_{\text{reg}}$.
  • Figure 3: Encoder Output Visualisation: (Top) Input images. (Middle) Features from a fine-tuned MAE, which serve as initialisation to our encoder. (Bottom) Features after end-to-end training. Features shown after reducing to $3$ dimensions with PCA. Optimised features appear smoother and more segment-focused, supporting the fact that updating encoder weights significantly improves the performance (see also ablation in Table \ref{['tab:ablation_study']}).
  • Figure 4: Qualitative Results on Test (Unseen) Scenes: We show the capabilities of NViST to synthesize novel views of unknown scenes. The model correctly synthesizes images from different viewpoints of various categories with diverse backgrounds and scales.
  • Figure 5: Results on Unseen Category: This figure shows how the model generalises to a novel category unseen at training. We validate our model with a held-out category (toy-cars).
  • ...and 7 more figures