TFS-NeRF: Template-Free NeRF for Semantic 3D Reconstruction of Dynamic Scene
Sandika Biswas, Qianyi Wu, Biplab Banerjee, Hamid Rezatofighi
TL;DR
Dynamic scene reconstruction with arbitrary rigid, non-rigid, and deformable entities from sparse RGB data is addressed by TFS-NeRF, a template-free semantic NeRF that learns per-entity skinning without template models. The method uses an invertible neural network-based forward $LBS$ conditioned on frame skeletons to map view-space points to canonical space, paired with separate SDFs for deformable and non-deformable parts and a shared RGB renderer; semantic-aware ray sampling yields two point sets $x^d_v$ and $x^nd_v$ for independent geometry. Key contributions include time-efficient template-free 3D semantic reconstruction for two interacting entities and extensive validation on BEHAVE, HO3D-V3, and ZJU-MoCap showing improved quality and faster convergence. This approach enables accurate, semantically separable reconstructions applicable to AR/VR and robotics, with potential extension to more complex multi-object scenes through occlusion-aware strategies.
Abstract
Despite advancements in Neural Implicit models for 3D surface reconstruction, handling dynamic environments with interactions between arbitrary rigid, non-rigid, or deformable entities remains challenging. The generic reconstruction methods adaptable to such dynamic scenes often require additional inputs like depth or optical flow or rely on pre-trained image features for reasonable outcomes. These methods typically use latent codes to capture frame-by-frame deformations. Another set of dynamic scene reconstruction methods, are entity-specific, mostly focusing on humans, and relies on template models. In contrast, some template-free methods bypass these requirements and adopt traditional LBS (Linear Blend Skinning) weights for a detailed representation of deformable object motions, although they involve complex optimizations leading to lengthy training times. To this end, as a remedy, this paper introduces TFS-NeRF, a template-free 3D semantic NeRF for dynamic scenes captured from sparse or single-view RGB videos, featuring interactions among two entities and more time-efficient than other LBS-based approaches. Our framework uses an Invertible Neural Network (INN) for LBS prediction, simplifying the training process. By disentangling the motions of interacting entities and optimizing per-entity skinning weights, our method efficiently generates accurate, semantically separable geometries. Extensive experiments demonstrate that our approach produces high-quality reconstructions of both deformable and non-deformable objects in complex interactions, with improved training efficiency compared to existing methods.
