Table of Contents
Fetching ...

PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM

Alan Dao, Dinh Bach Vu, Tuan Le Duc Anh, Bui Quang Huy

TL;DR

PoseLess presents a depth-free framework that directly maps monocular hand images to 25 joint angles without explicit pose estimation, leveraging a vision-language model and a synthetic data pipeline with domain randomization. By training on 100,000 synthetic image–joint pairs generated from a controlled 25-DOF hand model, the method achieves competitive joint-angle prediction while bypassing real-world labeled data. The approach demonstrates cross-morphology generalization from robotic to human hands and suggests practical benefits for prosthetics and human-robot interaction, all while simplifying hardware by removing depth requirements. Limitations arise from the controlled rendering setup, motivating future work to incorporate real-world variability, multi-view, and temporal information to bolster robustness across unconstrained environments.

Abstract

This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using projected representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from robotic to human hands. By projecting visual inputs and employing a transformer-based decoder, PoseLess achieves robust, low-latency control while addressing challenges such as depth ambiguity and data scarcity. Experimental results demonstrate competitive performance in joint angle prediction accuracy without relying on any human-labelled dataset.

PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM

TL;DR

PoseLess presents a depth-free framework that directly maps monocular hand images to 25 joint angles without explicit pose estimation, leveraging a vision-language model and a synthetic data pipeline with domain randomization. By training on 100,000 synthetic image–joint pairs generated from a controlled 25-DOF hand model, the method achieves competitive joint-angle prediction while bypassing real-world labeled data. The approach demonstrates cross-morphology generalization from robotic to human hands and suggests practical benefits for prosthetics and human-robot interaction, all while simplifying hardware by removing depth requirements. Limitations arise from the controlled rendering setup, motivating future work to incorporate real-world variability, multi-view, and temporal information to bolster robustness across unconstrained environments.

Abstract

This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using projected representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from robotic to human hands. By projecting visual inputs and employing a transformer-based decoder, PoseLess achieves robust, low-latency control while addressing challenges such as depth ambiguity and data scarcity. Experimental results demonstrate competitive performance in joint angle prediction accuracy without relying on any human-labelled dataset.

Paper Structure

This paper contains 20 sections, 2 equations, 2 figures.

Figures (2)

  • Figure 1: How PoseLess works
  • Figure 2: Line chart depicting the Average MSE for different training checkpoints.