Table of Contents
Fetching ...

PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Jianning Deng, Kartic Subr, Hakan Bilen

Abstract

We present a methodology to model articulated objects using a sparse set of images with unknown poses. Current methods require dense multi-view observations and ground-truth camera poses. Our approach operates with as few as four views per articulation and no camera supervision. Our central insight is to first solve a robust correspondence and alignment problem between unaligned reconstructions, before part motions can be analyzed. We first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we optimize geometry, appearance, and kinematics jointly with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Abstract

We present a methodology to model articulated objects using a sparse set of images with unknown poses. Current methods require dense multi-view observations and ground-truth camera poses. Our approach operates with as few as four views per articulation and no camera supervision. Our central insight is to first solve a robust correspondence and alignment problem between unaligned reconstructions, before part motions can be analyzed. We first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we optimize geometry, appearance, and kinematics jointly with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

Paper Structure

This paper contains 36 sections, 8 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Given few unposed views of an articulated object across different articulated poses, PAOLI (our method) reconstructs its 3D geometry and appearance, while estimating part segmentation and joint axis of its moving parts (shown as arrows) in 3D. The reconstructed object can be rendered from novel viewpoints and articulation configurations.
  • Figure 2: Overview of our method pipeline. Given two sets of K-view images (K=4) of an articulated object in different states, our approach reconstructs high-fidelity 3D geometry and estimates articulation parameters through a three-stage process: (1) initialization of Gaussian splats and camera parameters (top left), (2) Gaussian alignment via deformation fields (top right), and (3) joint optimization of geometry and articulation parameters (bottom).
  • Figure 3: Illustration of the extension to multiple parts. We progressively estimate the segmentation and articulation one part at a time with extra images for new target articulation states.
  • Figure 4: Qualitative analysis of correspondence methods. We visualize the part segmentations computed by our TEASER yang2020teaser solver, which is fed correspondences from each method. Inliers for the two main segmented parts are colored blue and orange; outliers and non-correspondences are shown in gray. Better view in color and zoom in.
  • Figure 5: Qualitative evaluation for novel view synthesis in target state. We can see from the results that both AGS-GT and AGS-VGGT fail to reconstruct the object in the setting of 4-view images. In the meanwhile, our method demonstrates similar rendering quality compared to AGS-Full, which is trained with 100 images per articulation state with ground truth camera poses.
  • ...and 10 more figures