Table of Contents
Fetching ...

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam

TL;DR

ConceptPose tackles training-free, zero-shot 6D object pose estimation by leveraging language-driven concepts and vision-language model explainability. It generates concept vectors from LLM-derived descriptors and uses GradCAM-based saliency to ground these concepts in 3D, enabling dense 3D-3D correspondences without CAD models or task-specific training. Through RANSAC and ICP, it yields accurate relative poses and demonstrates state-of-the-art performance on standard zero-shot benchmarks, along with competitive few-shot tracking results. The approach democratizes pose estimation by decoupling from dataset-specific training and CAD models, enabling on-the-fly adaptation to novel objects via language concepts.

Abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

TL;DR

ConceptPose tackles training-free, zero-shot 6D object pose estimation by leveraging language-driven concepts and vision-language model explainability. It generates concept vectors from LLM-derived descriptors and uses GradCAM-based saliency to ground these concepts in 3D, enabling dense 3D-3D correspondences without CAD models or task-specific training. Through RANSAC and ICP, it yields accurate relative poses and demonstrates state-of-the-art performance on standard zero-shot benchmarks, along with competitive few-shot tracking results. The approach democratizes pose estimation by decoupling from dataset-specific training and CAD models, enabling on-the-fly adaptation to novel objects via language concepts.

Abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.

Paper Structure

This paper contains 18 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: From language concepts to 6D Pose: ConceptPose uses language-driven concepts to create 3D concept maps and match them across views for training-free 6D pose estimation.
  • Figure 2: Overview of the ConceptPose pipeline for zero shot relative pose estimation. Given an anchor-query RGB-D pair and category name, we first generate concepts via LLM. These concepts are used to query a VLM to generate dense saliency maps for both frames . The saliency maps are backprojected into 3D and stacked into concept activation vectors, enabling robust semantic correspondence matching for RANSAC-based relative pose estimation.
  • Figure 3: Qualitative results of ConceptPose's zero shot relative pose estimation on REAL275, Toyota-Light, and YCB-Video. For each example, the first column displays the cropped RGB image, followed by eight columns showing distinct semantic concepts extracted for the object category along with their corresponding saliency maps. The final column presents the ground truth anchor pose (top) and the estimated query pose vs. the ground truth query pose (bottom), obtained by applying the estimated relative transformation to the ground truth anchor pose. Notably, concept localization succeeds even on semantically simple symmetric objects with few distinctive parts, such as correctly identifying the base of an upside-down cup.
  • Figure 4: Ablation study on concepts quantity on TYOL dataset.
  • Figure 5: Qualitative visualization of performance changes across different numbers of concepts ($L$) on REAL275 dataset. The first column is the anchor pose, column 2-6 are estimated query poses with different numbers of concepts ($L$) and ground truth query pose.
  • ...and 2 more figures