Table of Contents
Fetching ...

Adjustable Visual Appearance for Generalizable Novel View Synthesis

Josef Bengtson, David Nilsson, Che-Tsung Lin, Marcel Büsching, Fredrik Kahl

TL;DR

We address the problem of generalizable novel view synthesis with controllable appearance by extending a pretrained generalizable NeRF Transformer (GNT) with a latent appearance variable $z_{c'}$ and an appearance-alignment objective. The method renders 3D-consistent novel views of unseen scenes while allowing appearance changes to match a target weather or lighting condition, and supports smooth interpolation in the appearance space. Key contributions include (i) a latent appearance conditioned rendering pipeline, (ii) a dedicated appearance loss that aligns renderings to target conditions, and (iii) a synthetic CARLA-based dataset with four appearance conditions for training and evaluation, plus demonstrations on real data (Spaces). Empirically, the approach outperforms 2D style transfer baselines and Instruct-NeRF2NeRF in terms of rendering quality and temporal/multi-view consistency, while enabling appearance edits without scene-specific training and with fewer input images, highlighting its practical utility for cross-scene appearance editing in VR/AR pipelines.

Abstract

We present a generalizable novel view synthesis method which enables modifying the visual appearance of an observed scene so rendered views match a target weather or lighting condition without any scene specific training or access to reference views at the target condition. Our method is based on a pretrained generalizable transformer architecture and is fine-tuned on synthetically generated scenes under different appearance conditions. This allows for rendering novel views in a consistent manner for 3D scenes that were not included in the training set, along with the ability to (i) modify their appearance to match the target condition and (ii) smoothly interpolate between different conditions. Experiments on real and synthetic scenes show that our method is able to generate 3D consistent renderings while making realistic appearance changes, including qualitative and quantitative comparisons. Please refer to our project page for video results: https://ava-nvs.github.io/

Adjustable Visual Appearance for Generalizable Novel View Synthesis

TL;DR

We address the problem of generalizable novel view synthesis with controllable appearance by extending a pretrained generalizable NeRF Transformer (GNT) with a latent appearance variable and an appearance-alignment objective. The method renders 3D-consistent novel views of unseen scenes while allowing appearance changes to match a target weather or lighting condition, and supports smooth interpolation in the appearance space. Key contributions include (i) a latent appearance conditioned rendering pipeline, (ii) a dedicated appearance loss that aligns renderings to target conditions, and (iii) a synthetic CARLA-based dataset with four appearance conditions for training and evaluation, plus demonstrations on real data (Spaces). Empirically, the approach outperforms 2D style transfer baselines and Instruct-NeRF2NeRF in terms of rendering quality and temporal/multi-view consistency, while enabling appearance edits without scene-specific training and with fewer input images, highlighting its practical utility for cross-scene appearance editing in VR/AR pipelines.

Abstract

We present a generalizable novel view synthesis method which enables modifying the visual appearance of an observed scene so rendered views match a target weather or lighting condition without any scene specific training or access to reference views at the target condition. Our method is based on a pretrained generalizable transformer architecture and is fine-tuned on synthetically generated scenes under different appearance conditions. This allows for rendering novel views in a consistent manner for 3D scenes that were not included in the training set, along with the ability to (i) modify their appearance to match the target condition and (ii) smoothly interpolate between different conditions. Experiments on real and synthetic scenes show that our method is able to generate 3D consistent renderings while making realistic appearance changes, including qualitative and quantitative comparisons. Please refer to our project page for video results: https://ava-nvs.github.io/
Paper Structure (24 sections, 3 equations, 14 figures, 5 tables)

This paper contains 24 sections, 3 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Given multiple views of a scene in one weather and lighting condition, we want to generate novel views of the given scene with adjusted visual appearance corresponding to a target condition without scene specific optimization.
  • Figure 2: Overview of our method for changing visual appearance of synthesized novel views. A target view direction is chosen and camera rays $\mathbf{r}$ are cast and the corresponding source views $\mathbf{I}_{s,c}$ are used to generate a scene representation. A latent appearance variable $\mathbf{z_{c'}}$ is included with the goal of adapting the appearance of the rendered image to match the target view. If the target view is at a different weather or daylight conditions ($c\not=c'$) then this means adapting the visual appearance to match that found in the target view $\mathbf{I}_{t,c'}$ instead of the visual appearance of the source views $\mathbf{I}_{s,c}$.
  • Figure 3: Appearance change from the day condition into three other conditions. We observe that our method is able to take images at one condition and generate new views of that scene at the three other conditions by changing the overall visual appearance of the images to match the desired condition and by making local changes such as turning on street lamps.
  • Figure 4: Comparing our method with Instruct-NeRF2NeRF in2n as well as applying different 2D style transfer methods on rendered images. We note that Instruct-Pix2Pix brooks2022instructpix2pix effectively generates realistic 2D edits; however, it exhibits significant inconsistencies that Instruct-NeRF2NeRF fails to consolidate in 3D, leading to an unrealistic appearance. Only our method, Pix2Pix-HD wang2018high and Palette saharia2022palette learn to turn on the street lamps. SANet park2019arbitrary and CyEDA beh2022cyeda achieve better structure preservation with some noticeable artifacts. The diffusion models DiffuseIT kwon2022diffusion and Instruct-Pix2Pix brooks2022instructpix2pix can provide visually plausible results for individual images, but there are hallucinations that do not exist in the original images, leading to multi-view inconsistencies. Palette provides more realistic images, but it is however lacking in temporal consistency. Comparisons for additional conditions can be found in appendix \ref{['sec:MoreScenarios']}.
  • Figure 5: Gradually changing visual appearance by interpolating between latent appearance variables corresponding to day and night. The first row corresponds to latent variables generated with a given structure, enforcing that the evening condition lies between day and night in the latent space, and the second row corresponds to a learned latent variable with no enforced structure. Given images at one appearance condition our method is able to smoothly transition the appearance to match a different weather and lighting condition, generating plausible intermediate visual appearances. Additional results for interpolation can be seen in the video on the project page: https://ava-nvs.github.io
  • ...and 9 more figures