ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam
TL;DR
ConceptPose tackles training-free, zero-shot 6D object pose estimation by leveraging language-driven concepts and vision-language model explainability. It generates concept vectors from LLM-derived descriptors and uses GradCAM-based saliency to ground these concepts in 3D, enabling dense 3D-3D correspondences without CAD models or task-specific training. Through RANSAC and ICP, it yields accurate relative poses and demonstrates state-of-the-art performance on standard zero-shot benchmarks, along with competitive few-shot tracking results. The approach democratizes pose estimation by decoupling from dataset-specific training and CAD models, enabling on-the-fly adaptation to novel objects via language concepts.
Abstract
Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.
