Table of Contents
Fetching ...

LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation

Archana Swaminathan, Anubhav Gupta, Kamal Gupta, Shishira R. Maiya, Vatsal Agarwal, Abhinav Shrivastava

TL;DR

LEIA addresses the challenge of reconstructing and interpolating articulated 3D objects with NeRFs without relying on motion priors or ground-truth 3D supervision. It introduces a latent state dictionary where each articulation state $t$ has a learnable embedding $z_t$ that conditions a hypernetwork $h_l$ to modulate a base NeRF (Instant-NGP), enabling a single universal model to represent multiple motions. A latent manifold loss, along with depth and occlusion regularizers, fosters structured embeddings and stable interpolation, allowing the synthesis of unseen intermediate articulations by linear combinations in latent space. The approach demonstrates superior performance on multi-part objects, scales to multiple joints, and generalizes to real-world data, highlighting its significance for scalable 3D articulation without heavy priors or supervision.

Abstract

Neural Radiance Fields (NeRFs) have revolutionized the reconstruction of static scenes and objects in 3D, offering unprecedented quality. However, extending NeRFs to model dynamic objects or object articulations remains a challenging problem. Previous works have tackled this issue by focusing on part-level reconstruction and motion estimation for objects, but they often rely on heuristics regarding the number of moving parts or object categories, which can limit their practical use. In this work, we introduce LEIA, a novel approach for representing dynamic 3D objects. Our method involves observing the object at distinct time steps or "states" and conditioning a hypernetwork on the current state, using this to parameterize our NeRF. This approach allows us to learn a view-invariant latent representation for each state. We further demonstrate that by interpolating between these states, we can generate novel articulation configurations in 3D space that were previously unseen. Our experimental results highlight the effectiveness of our method in articulating objects in a manner that is independent of the viewing angle and joint configuration. Notably, our approach outperforms previous methods that rely on motion information for articulation registration.

LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation

TL;DR

LEIA addresses the challenge of reconstructing and interpolating articulated 3D objects with NeRFs without relying on motion priors or ground-truth 3D supervision. It introduces a latent state dictionary where each articulation state has a learnable embedding that conditions a hypernetwork to modulate a base NeRF (Instant-NGP), enabling a single universal model to represent multiple motions. A latent manifold loss, along with depth and occlusion regularizers, fosters structured embeddings and stable interpolation, allowing the synthesis of unseen intermediate articulations by linear combinations in latent space. The approach demonstrates superior performance on multi-part objects, scales to multiple joints, and generalizes to real-world data, highlighting its significance for scalable 3D articulation without heavy priors or supervision.

Abstract

Neural Radiance Fields (NeRFs) have revolutionized the reconstruction of static scenes and objects in 3D, offering unprecedented quality. However, extending NeRFs to model dynamic objects or object articulations remains a challenging problem. Previous works have tackled this issue by focusing on part-level reconstruction and motion estimation for objects, but they often rely on heuristics regarding the number of moving parts or object categories, which can limit their practical use. In this work, we introduce LEIA, a novel approach for representing dynamic 3D objects. Our method involves observing the object at distinct time steps or "states" and conditioning a hypernetwork on the current state, using this to parameterize our NeRF. This approach allows us to learn a view-invariant latent representation for each state. We further demonstrate that by interpolating between these states, we can generate novel articulation configurations in 3D space that were previously unseen. Our experimental results highlight the effectiveness of our method in articulating objects in a manner that is independent of the viewing angle and joint configuration. Notably, our approach outperforms previous methods that rely on motion information for articulation registration.
Paper Structure (13 sections, 10 equations, 7 figures, 4 tables)

This paper contains 13 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Our method LEIA, takes in multi-view images of an object in four articulation states and is able to learn a view-invariant latent embedding for the state. We show that we can interpolate between the latents to generate any number of intermediate unseen states for the object using LEIA, given the camera position.
  • Figure 2: Overview of our method. We take multi-view images in different states as input. A learnable latent dictionary based off an autoencoder learns an embedding per state id. The latent embedding is used as an input to the hypernet, that modulates and generates weights of the NeRF to reconstruct the state that is fed in. At inference time, we do a weighed interpolation of the learnt latents to obtain a corresponding newly generated intermediate state.
  • Figure 3: Qualitative Results. We show results of PARIS and LEIA for reconstructing the unseen intermediate state, for both single and multiple articulations. We see that PARIS especially fails when there are two parts of the object moving differently, as the motion parameters are not registered correctly. LEIA handles this case successfully as it is not dependent on part disentanglement to identify and register articulation. LEIA also performs comparable to PARIS for single-part articulation, despite not having a dedicated model for the motion or part disentanglement.
  • Figure 4: Real World Results. LEIA is able to faithfully interpolate and reconstruct between two states of images from our real world data, proving its ability to generalize and work in an in-the-wild setting.
  • Figure 5: t-SNE plot. After dimensionality reduction on jointly-learned state embeddings of an object with different moving parts. Our learned representations are separated and follow a smooth trajectory for each of the moving parts.
  • ...and 2 more figures