Table of Contents
Fetching ...

Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation

Prakhar Kaushik, Aayush Mishra, Adam Kortylewski, Alan Yuille

TL;DR

This work tackles source-free unsupervised domain adaptation for category-level 3D pose estimation using only RGB target images. It introduces 3DUDA, which leverages local robust object parts via a differentiable neural mesh and neural feature rendering to update vertex features selectively, guided by per-vertex similarity measures and an EM-like optimization loop. The authors provide a theoretical result showing that selective vertex adaptation can emulate global pseudo-labeling and demonstrate strong empirical gains across real nuisances, occlusion, and extreme UDA scenarios on OOD-CV and Pascal3D+ datasets. The approach achieves robust pose estimation without source data or target-depth/3D annotations, with practical impact in real-world deployment where 3D data is scarce or unavailable.

Abstract

We consider the problem of source-free unsupervised category-level pose estimation from only RGB images to a target domain without any access to source domain data or 3D annotations during adaptation. Collecting and annotating real-world 3D data and corresponding images is laborious, expensive, yet unavoidable process, since even 3D pose domain adaptation methods require 3D data in the target domain. We introduce 3DUDA, a method capable of adapting to a nuisance-ridden target domain without 3D or depth data. Our key insight stems from the observation that specific object subparts remain stable across out-of-domain (OOD) scenarios, enabling strategic utilization of these invariant subcomponents for effective model updates. We represent object categories as simple cuboid meshes, and harness a generative model of neural feature activations modeled at each mesh vertex learnt using differential rendering. We focus on individual locally robust mesh vertex features and iteratively update them based on their proximity to corresponding features in the target domain even when the global pose is not correct. Our model is then trained in an EM fashion, alternating between updating the vertex features and the feature extractor. We show that our method simulates fine-tuning on a global pseudo-labeled dataset under mild assumptions, which converges to the target domain asymptotically. Through extensive empirical validation, including a complex extreme UDA setup which combines real nuisances, synthetic noise, and occlusion, we demonstrate the potency of our simple approach in addressing the domain shift challenge and significantly improving pose estimation accuracy.

Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation

TL;DR

This work tackles source-free unsupervised domain adaptation for category-level 3D pose estimation using only RGB target images. It introduces 3DUDA, which leverages local robust object parts via a differentiable neural mesh and neural feature rendering to update vertex features selectively, guided by per-vertex similarity measures and an EM-like optimization loop. The authors provide a theoretical result showing that selective vertex adaptation can emulate global pseudo-labeling and demonstrate strong empirical gains across real nuisances, occlusion, and extreme UDA scenarios on OOD-CV and Pascal3D+ datasets. The approach achieves robust pose estimation without source data or target-depth/3D annotations, with practical impact in real-world deployment where 3D data is scarce or unavailable.

Abstract

We consider the problem of source-free unsupervised category-level pose estimation from only RGB images to a target domain without any access to source domain data or 3D annotations during adaptation. Collecting and annotating real-world 3D data and corresponding images is laborious, expensive, yet unavoidable process, since even 3D pose domain adaptation methods require 3D data in the target domain. We introduce 3DUDA, a method capable of adapting to a nuisance-ridden target domain without 3D or depth data. Our key insight stems from the observation that specific object subparts remain stable across out-of-domain (OOD) scenarios, enabling strategic utilization of these invariant subcomponents for effective model updates. We represent object categories as simple cuboid meshes, and harness a generative model of neural feature activations modeled at each mesh vertex learnt using differential rendering. We focus on individual locally robust mesh vertex features and iteratively update them based on their proximity to corresponding features in the target domain even when the global pose is not correct. Our model is then trained in an EM fashion, alternating between updating the vertex features and the feature extractor. We show that our method simulates fine-tuning on a global pseudo-labeled dataset under mild assumptions, which converges to the target domain asymptotically. Through extensive empirical validation, including a complex extreme UDA setup which combines real nuisances, synthetic noise, and occlusion, we demonstrate the potency of our simple approach in addressing the domain shift challenge and significantly improving pose estimation accuracy.
Paper Structure (39 sections, 2 theorems, 9 equations, 10 figures, 50 tables)

This paper contains 39 sections, 2 theorems, 9 equations, 10 figures, 50 tables.

Key Result

Theorem 2.4

A target domain $\mathcal{X_T}$ satisfying assumption ass:pwso, elicits another target domain $\mathcal{X_T}^e$ such that each sample in $\mathcal{X_T}^e$ satisfies the global-pseudo labelling constraint ($\mathcal{L_\text{sim}}(f_{i\rightarrow r},C_r) > \delta_r \space\forall \space r \in \{1, 2, .

Figures (10)

  • Figure 1: Our method utilizes two key observations- (a) Local Pose Ambiguity, refers to the inherent pose ambiguity that occurs when we can only see a part of the object. We utilize this ambiguity to update the local vertex features which roughly correspond to object parts, even when the global pose of the object may be incorrectly estimated. (b) Local Part Robustness refers to the fact that certain parts (e.g. headlights in a car) are less affected in OOD data, which is verified by the (azimuth) polar histogram representing the percentage of robustly detected vertex features per image in target domain (OOD-CV oodcv) using the source model (Before Adaptation). Even before adaptation, there are a few vertices which can be detected robustly and therefore are leveraged by our method to adapt to the target domain as seen by the increased robust vertex ratio After Adaptation.
  • Figure 2: Overview of Our Method (3DUDA)
  • Figure 3: We extract neural features from source model CNN backbone $f_i=\phi_w({\mathcal{X_T}})$ and render feature maps from the source mesh model ($\mathfrak{M_{\mathcal{S}}}$) (using vertex features $C_r$) and the pose estimate is optimized using render-and-compare (b) For this incorrectly estimated global pose, we measure similarity of every individual visible vertex feature with the corresponding image feature vector in $f_i$independently (\ref{['eq:6']}) and update individual vertex features using average feature vector values for a batch of images (\ref{['eq:7']}). (c) The mesh model is then updated using these changed vertices and the backbone is optimized using the optimized neural mesh.
  • Figure 4: Qualitative Results of 3DUDA compared to ground truth and NeMo nemo. 3DUDA adapts to real world OOD target domains consisting of nuisances like weather and occlusion in an unsupervised manner and produces robust 3D object pose estimates. The CAD objects are for representation only and are taken from ShapeNet chang2015shapenet.
  • Figure 5: The elicited target distribution $P(X_T^e)$ found by SVA may not be precisely the same as the true target distribution $P(X_T^*)$, but asymptotically (shown by arrows) it tends to the true distribution and the same happens to the adapted source model.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 2.1
  • Definition 2.2
  • Theorem 2.4
  • Theorem A.1