Table of Contents
Fetching ...

TFS-NeRF: Template-Free NeRF for Semantic 3D Reconstruction of Dynamic Scene

Sandika Biswas, Qianyi Wu, Biplab Banerjee, Hamid Rezatofighi

TL;DR

Dynamic scene reconstruction with arbitrary rigid, non-rigid, and deformable entities from sparse RGB data is addressed by TFS-NeRF, a template-free semantic NeRF that learns per-entity skinning without template models. The method uses an invertible neural network-based forward $LBS$ conditioned on frame skeletons to map view-space points to canonical space, paired with separate SDFs for deformable and non-deformable parts and a shared RGB renderer; semantic-aware ray sampling yields two point sets $x^d_v$ and $x^nd_v$ for independent geometry. Key contributions include time-efficient template-free 3D semantic reconstruction for two interacting entities and extensive validation on BEHAVE, HO3D-V3, and ZJU-MoCap showing improved quality and faster convergence. This approach enables accurate, semantically separable reconstructions applicable to AR/VR and robotics, with potential extension to more complex multi-object scenes through occlusion-aware strategies.

Abstract

Despite advancements in Neural Implicit models for 3D surface reconstruction, handling dynamic environments with interactions between arbitrary rigid, non-rigid, or deformable entities remains challenging. The generic reconstruction methods adaptable to such dynamic scenes often require additional inputs like depth or optical flow or rely on pre-trained image features for reasonable outcomes. These methods typically use latent codes to capture frame-by-frame deformations. Another set of dynamic scene reconstruction methods, are entity-specific, mostly focusing on humans, and relies on template models. In contrast, some template-free methods bypass these requirements and adopt traditional LBS (Linear Blend Skinning) weights for a detailed representation of deformable object motions, although they involve complex optimizations leading to lengthy training times. To this end, as a remedy, this paper introduces TFS-NeRF, a template-free 3D semantic NeRF for dynamic scenes captured from sparse or single-view RGB videos, featuring interactions among two entities and more time-efficient than other LBS-based approaches. Our framework uses an Invertible Neural Network (INN) for LBS prediction, simplifying the training process. By disentangling the motions of interacting entities and optimizing per-entity skinning weights, our method efficiently generates accurate, semantically separable geometries. Extensive experiments demonstrate that our approach produces high-quality reconstructions of both deformable and non-deformable objects in complex interactions, with improved training efficiency compared to existing methods.

TFS-NeRF: Template-Free NeRF for Semantic 3D Reconstruction of Dynamic Scene

TL;DR

Dynamic scene reconstruction with arbitrary rigid, non-rigid, and deformable entities from sparse RGB data is addressed by TFS-NeRF, a template-free semantic NeRF that learns per-entity skinning without template models. The method uses an invertible neural network-based forward conditioned on frame skeletons to map view-space points to canonical space, paired with separate SDFs for deformable and non-deformable parts and a shared RGB renderer; semantic-aware ray sampling yields two point sets and for independent geometry. Key contributions include time-efficient template-free 3D semantic reconstruction for two interacting entities and extensive validation on BEHAVE, HO3D-V3, and ZJU-MoCap showing improved quality and faster convergence. This approach enables accurate, semantically separable reconstructions applicable to AR/VR and robotics, with potential extension to more complex multi-object scenes through occlusion-aware strategies.

Abstract

Despite advancements in Neural Implicit models for 3D surface reconstruction, handling dynamic environments with interactions between arbitrary rigid, non-rigid, or deformable entities remains challenging. The generic reconstruction methods adaptable to such dynamic scenes often require additional inputs like depth or optical flow or rely on pre-trained image features for reasonable outcomes. These methods typically use latent codes to capture frame-by-frame deformations. Another set of dynamic scene reconstruction methods, are entity-specific, mostly focusing on humans, and relies on template models. In contrast, some template-free methods bypass these requirements and adopt traditional LBS (Linear Blend Skinning) weights for a detailed representation of deformable object motions, although they involve complex optimizations leading to lengthy training times. To this end, as a remedy, this paper introduces TFS-NeRF, a template-free 3D semantic NeRF for dynamic scenes captured from sparse or single-view RGB videos, featuring interactions among two entities and more time-efficient than other LBS-based approaches. Our framework uses an Invertible Neural Network (INN) for LBS prediction, simplifying the training process. By disentangling the motions of interacting entities and optimizing per-entity skinning weights, our method efficiently generates accurate, semantically separable geometries. Extensive experiments demonstrate that our approach produces high-quality reconstructions of both deformable and non-deformable objects in complex interactions, with improved training efficiency compared to existing methods.
Paper Structure (5 sections, 7 equations, 7 figures, 8 tables)

This paper contains 5 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Existing dynamic-NeRF models struggle to generate plausible 3D reconstructions for generic dynamic scenes featuring humans and objects engaged in complex interactions. In this work, we introduce a Neural Radiance Field model designed for 3D reconstruction of such generic scenes, captured using a sparse/single-view video, capable of producing plausible geometry for each semantic element within the scene. In this figure, A: Input RGB, B: predicted normal map, C: predicted semantic reconstruction, and D: predicted skinning weight.
  • Figure 2: Overview of the system.A: To produce a semantically separable reconstruction of each element, first, we perform a semantic-aware ray sampling. Given a 2D semantic segmentation mask, we shoot two sets of rays and sample two sets of 3D points for differentiating the deformable and non-deformable entities of the scene, $\{x^d_v\}^N_{i=1}$, $\{x^{nd}_v\}^N_{i=1}$ under interactions. B: Next, each set of points is transformed from the deformed/view space (input frame) to its respective canonical space by inverse warping enabled by the learned forward LBS (Details are presented in Fig. \ref{['fig:f_cano']}. C: Then the individual geometry is predicted at the canonical space in the form of canonical SDFs by two independent SDF prediction networks $\mathcal{F}^j_{c -> \Omega}(\theta)$ for the deformable and non-deformable entities denoted as $j \in \{d, nd\}$. D: Finally, the output SDFs are used to predict a composite scene rendering. Both these branches are optimized jointly using the RGB reconstruction loss.
  • Figure 3: Overview of the transformation from view space to canonical space.
  • Figure 4: Qualitative comparison on ZJU-Mocap dataset peng2021neural.
  • Figure 5: Qualitative comparison with SoTA methods on BEHAVE dataset.
  • ...and 2 more figures