Table of Contents
Fetching ...

Synergistic Integration of Coordinate Network and Tensorial Feature for Improving Neural Radiance Fields from Sparse Inputs

Mingyu Kim, Jun-Seong Kim, Se-Young Yun, Jin-Hwa Kim

TL;DR

This work tackles the challenge of reconstructing neural radiance fields from sparse inputs by coupling coordinate-based networks, which provide global, low-frequency context, with tensorial multi-plane representations that capture high-frequency details. The method uses residual connections to fuse these heterogeneous features and introduces a curriculum weighting strategy to disentangle global and local information, along with a Laplacian denoising loss to stabilize training. Empirical results show strong gains over state-of-the-art baselines for both static and dynamic NeRFs under sparse views, with competitive performance using fewer parameters and improved training stability. The approach offers a robust, efficient pathway for high-quality novel-view synthesis in data-scarce scenarios, with potential impact on real-world 3D reconstruction and dynamic scene understanding.

Abstract

The multi-plane representation has been highlighted for its fast training and inference across static and dynamic neural radiance fields. This approach constructs relevant features via projection onto learnable grids and interpolating adjacent vertices. However, it has limitations in capturing low-frequency details and tends to overuse parameters for low-frequency features due to its bias toward fine details, despite its multi-resolution concept. This phenomenon leads to instability and inefficiency when training poses are sparse. In this work, we propose a method that synergistically integrates multi-plane representation with a coordinate-based MLP network known for strong bias toward low-frequency signals. The coordinate-based network is responsible for capturing low-frequency details, while the multi-plane representation focuses on capturing fine-grained details. We demonstrate that using residual connections between them seamlessly preserves their own inherent properties. Additionally, the proposed progressive training scheme accelerates the disentanglement of these two features. We demonstrate empirically that our proposed method not only outperforms baseline models for both static and dynamic NeRFs with sparse inputs, but also achieves comparable results with fewer parameters.

Synergistic Integration of Coordinate Network and Tensorial Feature for Improving Neural Radiance Fields from Sparse Inputs

TL;DR

This work tackles the challenge of reconstructing neural radiance fields from sparse inputs by coupling coordinate-based networks, which provide global, low-frequency context, with tensorial multi-plane representations that capture high-frequency details. The method uses residual connections to fuse these heterogeneous features and introduces a curriculum weighting strategy to disentangle global and local information, along with a Laplacian denoising loss to stabilize training. Empirical results show strong gains over state-of-the-art baselines for both static and dynamic NeRFs under sparse views, with competitive performance using fewer parameters and improved training stability. The approach offers a robust, efficient pathway for high-quality novel-view synthesis in data-scarce scenarios, with potential impact on real-world 3D reconstruction and dynamic scene understanding.

Abstract

The multi-plane representation has been highlighted for its fast training and inference across static and dynamic neural radiance fields. This approach constructs relevant features via projection onto learnable grids and interpolating adjacent vertices. However, it has limitations in capturing low-frequency details and tends to overuse parameters for low-frequency features due to its bias toward fine details, despite its multi-resolution concept. This phenomenon leads to instability and inefficiency when training poses are sparse. In this work, we propose a method that synergistically integrates multi-plane representation with a coordinate-based MLP network known for strong bias toward low-frequency signals. The coordinate-based network is responsible for capturing low-frequency details, while the multi-plane representation focuses on capturing fine-grained details. We demonstrate that using residual connections between them seamlessly preserves their own inherent properties. Additionally, the proposed progressive training scheme accelerates the disentanglement of these two features. We demonstrate empirically that our proposed method not only outperforms baseline models for both static and dynamic NeRFs with sparse inputs, but also achieves comparable results with fewer parameters.
Paper Structure (48 sections, 5 equations, 16 figures, 20 tables)

This paper contains 48 sections, 5 equations, 16 figures, 20 tables.

Figures (16)

  • Figure 1: The qualitative results of the standup case in dynamic NeRFs using 25 training poses $($about $17\%$ of the original data$)$. This is challenging due to the limited information available along the time axis. Figure (a) is produced by HexPlane. cao2023hexplane. Figure (b) is the rendered image of the proposed method.
  • Figure 2: Conceptual illustration of the proposed method utilizing global contexts by coordinate networks and fine-grained details by multi-plane encoding. This method effectively displays two heterogeneous features. The number 1 indicates the use of coordinate network alone, while the symbol 1$+$2 means the use of both coordinated-based MLP network and multi-plane representation.
  • Figure 3: The schematic of the proposed method. The feature acquisition and encoder are discussed in \ref{['subsec_architecture']} and \ref{['subsec_curriculum_weighting']}. The loss function and regularization are described in \ref{['subsec_loss_function']}.
  • Figure 4: Qualitative results on the image regression trained with a 50% random mask. The first row displays rendered images using only low-frequency or resolution features, while the second row shows images engaged with the full range of features. The numeric value indicates the average magnitude spectrum obtained from the Fourier transform.
  • Figure 5: Rendered images of lego, drums and ship cases in the static NeRF dataset by FreeNeRF, TensoRF, K-Planes and ours. The rendered images are $\{83, 129, 95\}$-th in the test set, respectively.
  • ...and 11 more figures