Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation
Prakhar Kaushik, Aayush Mishra, Adam Kortylewski, Alan Yuille
TL;DR
This work tackles source-free unsupervised domain adaptation for category-level 3D pose estimation using only RGB target images. It introduces 3DUDA, which leverages local robust object parts via a differentiable neural mesh and neural feature rendering to update vertex features selectively, guided by per-vertex similarity measures and an EM-like optimization loop. The authors provide a theoretical result showing that selective vertex adaptation can emulate global pseudo-labeling and demonstrate strong empirical gains across real nuisances, occlusion, and extreme UDA scenarios on OOD-CV and Pascal3D+ datasets. The approach achieves robust pose estimation without source data or target-depth/3D annotations, with practical impact in real-world deployment where 3D data is scarce or unavailable.
Abstract
We consider the problem of source-free unsupervised category-level pose estimation from only RGB images to a target domain without any access to source domain data or 3D annotations during adaptation. Collecting and annotating real-world 3D data and corresponding images is laborious, expensive, yet unavoidable process, since even 3D pose domain adaptation methods require 3D data in the target domain. We introduce 3DUDA, a method capable of adapting to a nuisance-ridden target domain without 3D or depth data. Our key insight stems from the observation that specific object subparts remain stable across out-of-domain (OOD) scenarios, enabling strategic utilization of these invariant subcomponents for effective model updates. We represent object categories as simple cuboid meshes, and harness a generative model of neural feature activations modeled at each mesh vertex learnt using differential rendering. We focus on individual locally robust mesh vertex features and iteratively update them based on their proximity to corresponding features in the target domain even when the global pose is not correct. Our model is then trained in an EM fashion, alternating between updating the vertex features and the feature extractor. We show that our method simulates fine-tuning on a global pseudo-labeled dataset under mild assumptions, which converges to the target domain asymptotically. Through extensive empirical validation, including a complex extreme UDA setup which combines real nuisances, synthetic noise, and occlusion, we demonstrate the potency of our simple approach in addressing the domain shift challenge and significantly improving pose estimation accuracy.
