Table of Contents
Fetching ...

Mask-Based Modeling for Neural Radiance Fields

Ganlin Yang, Guoqiang Wei, Zhizheng Zhang, Yan Lu, Dong Liu

TL;DR

This work proposes masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes.

Abstract

Most Neural Radiance Fields (NeRFs) exhibit limited generalization capabilities, which restrict their applicability in representing multiple scenes using a single model. To address this problem, existing generalizable NeRF methods simply condition the model on image features. These methods still struggle to learn precise global representations over diverse scenes since they lack an effective mechanism for interacting among different points and views. In this work, we unveil that 3D implicit representation learning can be significantly improved by mask-based modeling. Specifically, we propose masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray. With this pretraining target, MRVM-NeRF enables better use of correlations across different points and views as the geometry priors, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes. Extensive experiments demonstrate the effectiveness of our proposed MRVM-NeRF on both synthetic and real-world datasets, qualitatively and quantitatively. Besides, we also conduct experiments to show the compatibility of our proposed method with various backbones and its superiority under few-shot cases.

Mask-Based Modeling for Neural Radiance Fields

TL;DR

This work proposes masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes.

Abstract

Most Neural Radiance Fields (NeRFs) exhibit limited generalization capabilities, which restrict their applicability in representing multiple scenes using a single model. To address this problem, existing generalizable NeRF methods simply condition the model on image features. These methods still struggle to learn precise global representations over diverse scenes since they lack an effective mechanism for interacting among different points and views. In this work, we unveil that 3D implicit representation learning can be significantly improved by mask-based modeling. Specifically, we propose masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray. With this pretraining target, MRVM-NeRF enables better use of correlations across different points and views as the geometry priors, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes. Extensive experiments demonstrate the effectiveness of our proposed MRVM-NeRF on both synthetic and real-world datasets, qualitatively and quantitatively. Besides, we also conduct experiments to show the compatibility of our proposed method with various backbones and its superiority under few-shot cases.
Paper Structure (30 sections, 10 equations, 15 figures, 9 tables)

This paper contains 30 sections, 10 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Overview of our proposed MRVM-NeRF. To render an image from a target view, rays are cast into 3D space, and a series of points are sampled along each ray. These points are projected onto reference image planes to obtain pixel-aligned image features. We employ a coarse-to-fine sampling strategy and mask a portion of feature tokens input into the fine branch. The coarse and fine branches function as the target and online networks, respectively. Our mask-based pretraining objective $\mathcal{L}_{mrvm}$ aims to predict the corresponding latent representations of the target branch from the online ones within the latent space.
  • Figure 2: Illustration of masking operation. The striped rectangles denote the masked features which are randomly selected along the ray. The solid circles represent the points sampled at coarse stage and the hollow ones correspond to extra points sampled at fine stage. The rectangles with solid boxes are processed global view-invariant features by coarse and fine stage, and our MRVM task aims to align them in the same feature space.
  • Figure 3: Visualizations of ShapeNet-all (row 1-2), ShapeNet-unseen (row 3), ShapeNet-chair (row 4) and ShapeNet-car (row 5) settings. Our MRVM helps render novel views with more plausible structures, finer details and less artifacts.
  • Figure 4: Visualizations on NeRF Synthetic (first row), LLFF (middle row) and DTU (last row) datasets. Masked ray and view modeling aids in rendering images with enhanced texture details, reduced blurring and fewer artifacts.
  • Figure 5: Visualizations for cross-scene generalization on NeRF Synthetic (first row), LLFF (middle row) and DTU (last row) datasets.
  • ...and 10 more figures