Table of Contents
Fetching ...

ResFields: Residual Neural Fields for Spatiotemporal Signals

Marko Mihajlovic, Sergey Prokudin, Marc Pollefeys, Siyu Tang

TL;DR

ResFields tackles the capacity bottleneck of MLP-based neural fields for complex spatiotemporal signals by inserting time-conditioned residual layers into layer weights and factorizing these residuals with a low-rank scheme. The method increases expressive power without widening the base MLP, preserves implicit regularization, and remains broadly compatible with existing neural-field architectures. Across 2D video, temporal SDFs, dynamic NeRF, and scene-flow tasks, ResFields yield consistent gains in reconstruction quality and efficiency, including faster training and reduced memory usage. The work demonstrates strong generalization and practical promise for modeling dynamic scenes from sparse data, with open-source resources to support reproducibility and further development.

Abstract

Neural fields, a category of neural networks trained to represent high-frequency signals, have gained significant attention in recent years due to their impressive performance in modeling complex 3D data, such as signed distance (SDFs) or radiance fields (NeRFs), via a single multi-layer perceptron (MLP). However, despite the power and simplicity of representing signals with an MLP, these methods still face challenges when modeling large and complex temporal signals due to the limited capacity of MLPs. In this paper, we propose an effective approach to address this limitation by incorporating temporal residual layers into neural fields, dubbed ResFields. It is a novel class of networks specifically designed to effectively represent complex temporal signals. We conduct a comprehensive analysis of the properties of ResFields and propose a matrix factorization technique to reduce the number of trainable parameters and enhance generalization capabilities. Importantly, our formulation seamlessly integrates with existing MLP-based neural fields and consistently improves results across various challenging tasks: 2D video approximation, dynamic shape modeling via temporal SDFs, and dynamic NeRF reconstruction. Lastly, we demonstrate the practical utility of ResFields by showcasing its effectiveness in capturing dynamic 3D scenes from sparse RGBD cameras of a lightweight capture system.

ResFields: Residual Neural Fields for Spatiotemporal Signals

TL;DR

ResFields tackles the capacity bottleneck of MLP-based neural fields for complex spatiotemporal signals by inserting time-conditioned residual layers into layer weights and factorizing these residuals with a low-rank scheme. The method increases expressive power without widening the base MLP, preserves implicit regularization, and remains broadly compatible with existing neural-field architectures. Across 2D video, temporal SDFs, dynamic NeRF, and scene-flow tasks, ResFields yield consistent gains in reconstruction quality and efficiency, including faster training and reduced memory usage. The work demonstrates strong generalization and practical promise for modeling dynamic scenes from sparse data, with open-source resources to support reproducibility and further development.

Abstract

Neural fields, a category of neural networks trained to represent high-frequency signals, have gained significant attention in recent years due to their impressive performance in modeling complex 3D data, such as signed distance (SDFs) or radiance fields (NeRFs), via a single multi-layer perceptron (MLP). However, despite the power and simplicity of representing signals with an MLP, these methods still face challenges when modeling large and complex temporal signals due to the limited capacity of MLPs. In this paper, we propose an effective approach to address this limitation by incorporating temporal residual layers into neural fields, dubbed ResFields. It is a novel class of networks specifically designed to effectively represent complex temporal signals. We conduct a comprehensive analysis of the properties of ResFields and propose a matrix factorization technique to reduce the number of trainable parameters and enhance generalization capabilities. Importantly, our formulation seamlessly integrates with existing MLP-based neural fields and consistently improves results across various challenging tasks: 2D video approximation, dynamic shape modeling via temporal SDFs, and dynamic NeRF reconstruction. Lastly, we demonstrate the practical utility of ResFields by showcasing its effectiveness in capturing dynamic 3D scenes from sparse RGBD cameras of a lightweight capture system.
Paper Structure (20 sections, 6 equations, 7 figures, 13 tables)

This paper contains 20 sections, 6 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: ResField extends an MLP architecture to effectively represent complex temporal signals by replacing the conventional linear layers with Residual Field Layers. As such, ResField is versatile and straightforwardly compatible with most existing temporal neural fields. Here we demonstrate its applicability on three challenging tasks by extending Siren sitzmann2020implicit and TNeRF DyNeRF: (a) learning temporal signed distance fields and (b) neural radiance fields from four RGB views and (c) from three time-synchronized RGBD views captured by our lightweight rig. The figure is best viewed in electronic format on a color screen, please zoom-in to observe details.
  • Figure 2: ResField MLP Architecture.
  • Figure 3: Factorization of $\boldsymbol{\mathcal{W}}_i$.
  • Figure 4: 2D video approximation. Comparison of different neural fields on fitting RGB videos. The training and test PSNR curves (left and right respectively) indicate the trade-off between the model's capacity and generalization properties. Instant NGP offers good overfitting capabilities, however, it struggles to generalize to unseen pixels. A Siren MLP with 1024 neurons (Siren-1024), shows good generalization properties, however, it lacks representation power (low training and low test PSNR). A smaller Siren with 512 neurons implemented with ResFields (Siren-512+ResFields) demonstrates good generalization while offering higher model capacity. Besides the higher accuracy, our approach offers approximately 2.5 times faster convergence and 30% lower GPU memory requirements due to using a smaller MLP (Tab. \ref{['wrap-tab:video_approx']}). Results on the right provide a visual comparison of Siren with 256 neurons and Siren with 128 neurons implemented with ResField layers.
  • Figure 5: Temporal radiance fields on Owlii (Tab. \ref{['tab:rgb_results']}); metrics are averaged across all test views.
  • ...and 2 more figures