Table of Contents
Fetching ...

Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos

Leonhard Sommer, Artur Jesslen, Eddy Ilg, Adam Kortylewski

TL;DR

This work tackles unsupervised category-level 3D pose estimation from object-centric videos. It introduces a two-step pipeline: first, self-supervised multi-view alignment to a canonical frame using a robust 3D cyclical distance that fuses geometric and DINOv2-based appearance cues; second, learning dense image-to-template vertex correspondences on a prototypical neural mesh to enable single-image 3D pose estimation via render-and-compare. The key contributions are (i) a 3D cycle-based weighting mechanism for robust cross-view alignment, (ii) a neural-mesh representation with per-vertex features for dense correspondence learning, and (iii) a practical in-the-wild pose estimator trained without labels or CAD models. The approach yields substantial improvements over unsupervised baselines in alignment and achieves faithful, robust 3D pose predictions on Pascal3D+ and ObjectNet3D, demonstrating strong practical impact for robotics and real-world 3D understanding without supervision.

Abstract

Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics, e.g. for embodied agents or to train 3D generative models. However, so far methods that estimate the category-level object pose require either large amounts of human annotations, CAD models or input from RGB-D sensors. In contrast, we tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos without human supervision. We propose a two-step pipeline: First, we introduce a multi-view alignment procedure that determines canonical camera poses across videos with a novel and robust cyclic distance formulation for geometric and appearance matching using reconstructed coarse meshes and DINOv2 features. In a second step, the canonical poses and reconstructed meshes enable us to train a model for 3D pose estimation from a single image. In particular, our model learns to estimate dense correspondences between images and a prototypical 3D template by predicting, for each pixel in a 2D image, a feature vector of the corresponding vertex in the template mesh. We demonstrate that our method outperforms all baselines at the unsupervised alignment of object-centric videos by a large margin and provides faithful and robust predictions in-the-wild. Our code and data is available at https://github.com/GenIntel/uns-obj-pose3d.

Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos

TL;DR

This work tackles unsupervised category-level 3D pose estimation from object-centric videos. It introduces a two-step pipeline: first, self-supervised multi-view alignment to a canonical frame using a robust 3D cyclical distance that fuses geometric and DINOv2-based appearance cues; second, learning dense image-to-template vertex correspondences on a prototypical neural mesh to enable single-image 3D pose estimation via render-and-compare. The key contributions are (i) a 3D cycle-based weighting mechanism for robust cross-view alignment, (ii) a neural-mesh representation with per-vertex features for dense correspondence learning, and (iii) a practical in-the-wild pose estimator trained without labels or CAD models. The approach yields substantial improvements over unsupervised baselines in alignment and achieves faithful, robust 3D pose predictions on Pascal3D+ and ObjectNet3D, demonstrating strong practical impact for robotics and real-world 3D understanding without supervision.

Abstract

Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics, e.g. for embodied agents or to train 3D generative models. However, so far methods that estimate the category-level object pose require either large amounts of human annotations, CAD models or input from RGB-D sensors. In contrast, we tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos without human supervision. We propose a two-step pipeline: First, we introduce a multi-view alignment procedure that determines canonical camera poses across videos with a novel and robust cyclic distance formulation for geometric and appearance matching using reconstructed coarse meshes and DINOv2 features. In a second step, the canonical poses and reconstructed meshes enable us to train a model for 3D pose estimation from a single image. In particular, our model learns to estimate dense correspondences between images and a prototypical 3D template by predicting, for each pixel in a 2D image, a feature vector of the corresponding vertex in the template mesh. We demonstrate that our method outperforms all baselines at the unsupervised alignment of object-centric videos by a large margin and provides faithful and robust predictions in-the-wild. Our code and data is available at https://github.com/GenIntel/uns-obj-pose3d.
Paper Structure (20 sections, 17 equations, 11 figures, 6 tables)

This paper contains 20 sections, 17 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Illustration of our approach for the unsupervised learning of category-level 3D pose. Our method starts from unaligned object-centric videos of an object category (left) and aligns these into a canonical coordinate frame in a self-supervised manner using a prototypical 3D mesh and self-supervised transformer features (center). Using the aligned videos, we train a neural network backbone to predict 2D-3D correspondences from a single image to enable 3D object pose estimation in the wild (right).
  • Figure 2: Illustration of a neural mesh. It consists of a 3D mesh with one or several neural features per vertex. These features encode patch-level features from a feature extractor. In our unsupervised alignment method, we capture the viewpoint-dependent features from DINOv2. For pose estimation, we learn a category-level neural mesh with viewpoint-invariant features.
  • Figure 3: Qualitative comparison of two unsupervised alignment methods. The first row shows the alignment of our proposed method. The second row shows the alignment using ZSP goodwin2022zsp. For both methods we use the 5th object instance from left as reference. We see that our proposed method is more accurate compared with ZSP. Especially for cars ZSP often confuses back and front.
  • Figure 4: Qualitative comparison of our method (top) and ZSP (bottom) at category-level 3D pose prediction in the wild on samples from PASCAL3D+ and ObjectNet3D (we randomly selected the samples to demonstrate the diversity of the results). For both methods, we overlay our coarse mesh reconstruction in the predicted 3D pose.
  • Figure 5: We report the $30^\circ$ accuracy of our alignment method for different choices of our appearance weight $\alpha$ and our cyclical distance temperature $\tau$ resulting in different distances between two meshes with surface features. We see that the maximum accuracy of $75.2\%$ is reached for $\alpha=0.2$ and $\tau=100$.
  • ...and 6 more figures