LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation
Archana Swaminathan, Anubhav Gupta, Kamal Gupta, Shishira R. Maiya, Vatsal Agarwal, Abhinav Shrivastava
TL;DR
LEIA addresses the challenge of reconstructing and interpolating articulated 3D objects with NeRFs without relying on motion priors or ground-truth 3D supervision. It introduces a latent state dictionary where each articulation state $t$ has a learnable embedding $z_t$ that conditions a hypernetwork $h_l$ to modulate a base NeRF (Instant-NGP), enabling a single universal model to represent multiple motions. A latent manifold loss, along with depth and occlusion regularizers, fosters structured embeddings and stable interpolation, allowing the synthesis of unseen intermediate articulations by linear combinations in latent space. The approach demonstrates superior performance on multi-part objects, scales to multiple joints, and generalizes to real-world data, highlighting its significance for scalable 3D articulation without heavy priors or supervision.
Abstract
Neural Radiance Fields (NeRFs) have revolutionized the reconstruction of static scenes and objects in 3D, offering unprecedented quality. However, extending NeRFs to model dynamic objects or object articulations remains a challenging problem. Previous works have tackled this issue by focusing on part-level reconstruction and motion estimation for objects, but they often rely on heuristics regarding the number of moving parts or object categories, which can limit their practical use. In this work, we introduce LEIA, a novel approach for representing dynamic 3D objects. Our method involves observing the object at distinct time steps or "states" and conditioning a hypernetwork on the current state, using this to parameterize our NeRF. This approach allows us to learn a view-invariant latent representation for each state. We further demonstrate that by interpolating between these states, we can generate novel articulation configurations in 3D space that were previously unseen. Our experimental results highlight the effectiveness of our method in articulating objects in a manner that is independent of the viewing angle and joint configuration. Notably, our approach outperforms previous methods that rely on motion information for articulation registration.
