Table of Contents
Fetching ...

GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer

Youngho Yoon, Hyun-Kurl Jang, Kuk-Jin Yoon

TL;DR

This work tackles the challenge of reproducing high-frequency textures in generalizable neural rendering without scene-specific optimization. It introduces GMT, a plug-and-play module comprising RayDCN for geometry-aware cross-view feature alignment and TPFormer for texture-preserving multi-reference fusion, building on the G-NeRF framework with the alpha proxy $X^{alpha}$ and correlation $Corr$. The key contributions are the RayDCN, the TPFormer, and demonstration that their integration consistently improves multiple G-NeRF baselines across diverse datasets with favorable efficiency. The approach enables direct inter-ray interaction in the enhancement stage, yielding sharper textures and fewer artifacts, with practical implications for real-world NVS tasks.

Abstract

Novel view synthesis (NVS) aims to generate images at arbitrary viewpoints using multi-view images, and recent insights from neural radiance fields (NeRF) have contributed to remarkable improvements. Recently, studies on generalizable NeRF (G-NeRF) have addressed the challenge of per-scene optimization in NeRFs. The construction of radiance fields on-the-fly in G-NeRF simplifies the NVS process, making it well-suited for real-world applications. Meanwhile, G-NeRF still struggles in representing fine details for a specific scene due to the absence of per-scene optimization, even with texture-rich multi-view source inputs. As a remedy, we propose a Geometry-driven Multi-reference Texture transfer network (GMT) available as a plug-and-play module designed for G-NeRF. Specifically, we propose ray-imposed deformable convolution (RayDCN), which aligns input and reference features reflecting scene geometry. Additionally, the proposed texture preserving transformer (TP-Former) aggregates multi-view source features while preserving texture information. Consequently, our module enables direct interaction between adjacent pixels during the image enhancement process, which is deficient in G-NeRF models with an independent rendering process per pixel. This addresses constraints that hinder the ability to capture high-frequency details. Experiments show that our plug-and-play module consistently improves G-NeRF models on various benchmark datasets.

GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer

TL;DR

This work tackles the challenge of reproducing high-frequency textures in generalizable neural rendering without scene-specific optimization. It introduces GMT, a plug-and-play module comprising RayDCN for geometry-aware cross-view feature alignment and TPFormer for texture-preserving multi-reference fusion, building on the G-NeRF framework with the alpha proxy and correlation . The key contributions are the RayDCN, the TPFormer, and demonstration that their integration consistently improves multiple G-NeRF baselines across diverse datasets with favorable efficiency. The approach enables direct inter-ray interaction in the enhancement stage, yielding sharper textures and fewer artifacts, with practical implications for real-world NVS tasks.

Abstract

Novel view synthesis (NVS) aims to generate images at arbitrary viewpoints using multi-view images, and recent insights from neural radiance fields (NeRF) have contributed to remarkable improvements. Recently, studies on generalizable NeRF (G-NeRF) have addressed the challenge of per-scene optimization in NeRFs. The construction of radiance fields on-the-fly in G-NeRF simplifies the NVS process, making it well-suited for real-world applications. Meanwhile, G-NeRF still struggles in representing fine details for a specific scene due to the absence of per-scene optimization, even with texture-rich multi-view source inputs. As a remedy, we propose a Geometry-driven Multi-reference Texture transfer network (GMT) available as a plug-and-play module designed for G-NeRF. Specifically, we propose ray-imposed deformable convolution (RayDCN), which aligns input and reference features reflecting scene geometry. Additionally, the proposed texture preserving transformer (TP-Former) aggregates multi-view source features while preserving texture information. Consequently, our module enables direct interaction between adjacent pixels during the image enhancement process, which is deficient in G-NeRF models with an independent rendering process per pixel. This addresses constraints that hinder the ability to capture high-frequency details. Experiments show that our plug-and-play module consistently improves G-NeRF models on various benchmark datasets.
Paper Structure (16 sections, 10 equations, 8 figures, 7 tables)

This paper contains 16 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The proposed Geometry-driven Multi-reference Texture transfer (GMT) model.
  • Figure 2: Overall framework of the Geometry-driven Multi-reference Texture transfer (GMT) model. When generalizable NeRFs (G-NeRF) renders novel view image $I^{ren}$ with N source images $\{I^{src}_i\}_{i=1}^N$ and a target camera pose $P^{tar}$, the process inherently generates alpha point cloud $X^{alpha}$ for volume rendering process. Using $\alpha^{refine}$ extracted from $\alpha$ and correlation values $Corr$, RayDCN enables feature alignment considering scene geometry. Subsequently, TPFormer conducts multi-reference feature aggregation and the model generates final output $I^{tar}$.
  • Figure 3: Ray-imposed Deformable Convolution (RayDCN). It has a deformed kernel shape considering scene geometry and aggregates the source features of multiple rays.
  • Figure 4: Texture-Preserving Transformer (TPFormer). TPFormer aggregates features from multiple source views while preserving textures from the source image.
  • Figure 5: Qualitative comparisons of generalizable NeRF models on DTU, Real Forward-Facing, and Synthetic NeRF datasets.
  • ...and 3 more figures