Table of Contents
Fetching ...

CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs

Yingji Zhong, Lanqing Hong, Zhenguo Li, Dan Xu

TL;DR

A novel approach that models 3D spatial field consistency to improve NeRF's performance with sparse inputs by adopting a voxel-based ray sampling strategy and using a contrastive loss on the encoder output of the Transformer to further improve consistency within each voxel is proposed.

Abstract

Neural Radiance Fields (NeRF) have shown impressive capabilities for photorealistic novel view synthesis when trained on dense inputs. However, when trained on sparse inputs, NeRF typically encounters issues of incorrect density or color predictions, mainly due to insufficient coverage of the scene causing partial and sparse supervision, thus leading to significant performance degradation. While existing works mainly consider ray-level consistency to construct 2D learning regularization based on rendered color, depth, or semantics on image planes, in this paper we propose a novel approach that models 3D spatial field consistency to improve NeRF's performance with sparse inputs. Specifically, we first adopt a voxel-based ray sampling strategy to ensure that the sampled rays intersect with a certain voxel in 3D space. We then randomly sample additional points within the voxel and apply a Transformer to infer the properties of other points on each ray, which are then incorporated into the volume rendering. By backpropagating through the rendering loss, we enhance the consistency among neighboring points. Additionally, we propose to use a contrastive loss on the encoder output of the Transformer to further improve consistency within each voxel. Experiments demonstrate that our method yields significant improvement over different radiance fields in the sparse inputs setting, and achieves comparable performance with current works.

CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs

TL;DR

A novel approach that models 3D spatial field consistency to improve NeRF's performance with sparse inputs by adopting a voxel-based ray sampling strategy and using a contrastive loss on the encoder output of the Transformer to further improve consistency within each voxel is proposed.

Abstract

Neural Radiance Fields (NeRF) have shown impressive capabilities for photorealistic novel view synthesis when trained on dense inputs. However, when trained on sparse inputs, NeRF typically encounters issues of incorrect density or color predictions, mainly due to insufficient coverage of the scene causing partial and sparse supervision, thus leading to significant performance degradation. While existing works mainly consider ray-level consistency to construct 2D learning regularization based on rendered color, depth, or semantics on image planes, in this paper we propose a novel approach that models 3D spatial field consistency to improve NeRF's performance with sparse inputs. Specifically, we first adopt a voxel-based ray sampling strategy to ensure that the sampled rays intersect with a certain voxel in 3D space. We then randomly sample additional points within the voxel and apply a Transformer to infer the properties of other points on each ray, which are then incorporated into the volume rendering. By backpropagating through the rendering loss, we enhance the consistency among neighboring points. Additionally, we propose to use a contrastive loss on the encoder output of the Transformer to further improve consistency within each voxel. Experiments demonstrate that our method yields significant improvement over different radiance fields in the sparse inputs setting, and achieves comparable performance with current works.
Paper Structure (20 sections, 10 equations, 15 figures, 7 tables)

This paper contains 20 sections, 10 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Qualitative illustration of learned 3D radiance fields and rendered images of the proposed CVT (Contrastive In-Voxel Transformer)-xRF upon three baselines trained from sparse inputs of three views. Our CVT-xRF can significantly improve all the baselines. The radiance fields show different levels of 3D inconsistencies (marked in red boxes for BARF and SPARF), which result in failures or artifacts in rendered images. With CVT, we can obtain radiance fields of better 3D consistency and render images of much higher quality.
  • Figure 2: Illustration of the proposed CVT (Contrastive In-Voxel Transformer)-xRF for learning radiance fields from sparse inputs. It consists of three parts, i.e., a voxel-based ray sampling strategy, a local implicit constraint module, and a global explicit constraint module. For simplicity, two voxels are shown, along with two rays for each. The local implicit constraint is implemented by a light-weight In-Voxel Transformer which infers colors and densities of ray points by interacting with surrounding 3D points. The ray points are then inserted among the points from the importance sampler for rendering. The global explicit constraint is conducted by a voxel contrastive regularization, which regularizes the radiance properties between points in a voxel to be more similar than that of points across voxels.
  • Figure 3: The proposed local implicit constraint with a light-weight In-Voxel Transformer for 3D field consistency learning. The colors and densities of ray points $\{\hat{\textit{x}}_i\}_{i=1}^P$ are predicted by surrounding points $\{\tilde{\textbf{x}}_i\}_{i=1}^S$ through the Transformer. The predicted colors and densities are inserted into the ray for rendering.
  • Figure 4: Visualization of learned radiance fields (in 3D) and the corresponding rendering results of two baselines, i.e., BARF lin2021barf and SPARF truong2023sparf, with/without the proposed CVT.
  • Figure 5: The proposed global explicit constraint for field consistency learning is implemented by a contrastive loss. For each anchor, we select its positive pair in the same voxel, while its negative pairs in other voxels.
  • ...and 10 more figures