Table of Contents
Fetching ...

Training-Free Robot Pose Estimation using Off-the-Shelf Foundational Models

Laurence Liang

TL;DR

This paper investigates training-free robot pose estimation by using off-the-shelf frontier vision-language models to predict a robot arm's joint angles from a single image. It introduces a one-shot prompting workflow with a reference image to guide estimates and evaluates several frontier models on the DREAM-Mini Panda dataset across simulated and real-world views. The results show that current frontier VLMs can capture approximate pose but fail to provide precise joint angles, with test-time scaling and parameter-size scaling offering little improvement. The work establishes a baseline for training-free pose estimation and highlights practical applications as verification and safety-monitoring tools, while outlining dataset and prompting directions for future improvement.

Abstract

Pose estimation of a robot arm from visual inputs is a challenging task. However, with the increasing adoption of robot arms for both industrial and residential use cases, reliable joint angle estimation can offer improved safety and performance guarantees, and also be used as a verifier to further train robot policies. This paper introduces using frontier vision-language models (VLMs) as an ``off-the-shelf" tool to estimate a robot arm's joint angles from a single target image. By evaluating frontier VLMs on both synthetic and real-world image-data pairs, this paper establishes a performance baseline attained by current FLMs. In addition, this paper presents empirical results suggesting that test time scaling or parameter scaling alone does not lead to improved joint angle predictions.

Training-Free Robot Pose Estimation using Off-the-Shelf Foundational Models

TL;DR

This paper investigates training-free robot pose estimation by using off-the-shelf frontier vision-language models to predict a robot arm's joint angles from a single image. It introduces a one-shot prompting workflow with a reference image to guide estimates and evaluates several frontier models on the DREAM-Mini Panda dataset across simulated and real-world views. The results show that current frontier VLMs can capture approximate pose but fail to provide precise joint angles, with test-time scaling and parameter-size scaling offering little improvement. The work establishes a baseline for training-free pose estimation and highlights practical applications as verification and safety-monitoring tools, while outlining dataset and prompting directions for future improvement.

Abstract

Pose estimation of a robot arm from visual inputs is a challenging task. However, with the increasing adoption of robot arms for both industrial and residential use cases, reliable joint angle estimation can offer improved safety and performance guarantees, and also be used as a verifier to further train robot policies. This paper introduces using frontier vision-language models (VLMs) as an ``off-the-shelf" tool to estimate a robot arm's joint angles from a single target image. By evaluating frontier VLMs on both synthetic and real-world image-data pairs, this paper establishes a performance baseline attained by current FLMs. In addition, this paper presents empirical results suggesting that test time scaling or parameter scaling alone does not lead to improved joint angle predictions.

Paper Structure

This paper contains 22 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The pose estimation workflow consists of a reference prompt and a target prompt. The reference prompt contains the ground truth joint angles associated to a reference photo. The target prompt contains only a target photo of the robot arm. The vision-language model estimates the state of the target model (joint angles) and provides upper bound and lower bound estimates.
  • Figure 2: The three subsets of the DREAM dataset used for pose estimation lee2020icra:dream These subsets feature the Franka Emika Panda arm.
  • Figure 3: Labels for each joint of the Franka Emika Panda arm from the DREAM dataset. Image credit to lee2020icra:dream