Table of Contents
Fetching ...

From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Tessa Pulli, Stefan Thalhammer, Simon Schwaiger, Markus Vincze

TL;DR

The paper tackles the challenge of 6D pose estimation for novel objects in open Set environments by leveraging vision-language models for open-vocabulary localization. It introduces a promptable zero-shot pipeline that uses NeRF-based scene reconstruction and LERF relevancy maps to obtain a coarse object location, followed by RGB-D based point-cloud registration (e.g., Teaser++) to compute the 6D pose. The work also analyzes LERF’s capabilities and limitations for pose estimation, studying instance- and category-level prompts and hyperparameters, and discusses planned real-world grasping experiments. This approach enables open-set manipulation in unknown settings and aims to extend VLM-based perception to practical robotic manipulation tasks with minimal prior object models.

Abstract

Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.

From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

TL;DR

The paper tackles the challenge of 6D pose estimation for novel objects in open Set environments by leveraging vision-language models for open-vocabulary localization. It introduces a promptable zero-shot pipeline that uses NeRF-based scene reconstruction and LERF relevancy maps to obtain a coarse object location, followed by RGB-D based point-cloud registration (e.g., Teaser++) to compute the 6D pose. The work also analyzes LERF’s capabilities and limitations for pose estimation, studying instance- and category-level prompts and hyperparameters, and discusses planned real-world grasping experiments. This approach enables open-set manipulation in unknown settings and aims to extend VLM-based perception to practical robotic manipulation tasks with minimal prior object models.

Abstract

Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.
Paper Structure (6 sections, 2 figures)

This paper contains 6 sections, 2 figures.

Figures (2)

  • Figure 1: From a set of RGB(-D) images, a NeRF scene is reconstructed. Using LERF, the target object is detected via text prompting. The object centroid is then computed through three-dimensional semantic segmentation. Finally, the pose estimate is determined using a point cloud registration method, e.g. Teaser++ yang2020teaser.
  • Figure 2: VLM-based scene reconstruction. 3D reconstruction of scene 04 of the HouseCat6D dataset jung2024housecat6d with overlaid relevancy map generated with LERF kerr2023lerf for open language prompt teapot. Red shading indicates high relevancy between scene and prompt.