UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

Zhenhao Zhang; Jiaxin Liu; Ye Shi; Jingya Wang

UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

Zhenhao Zhang, Jiaxin Liu, Ye Shi, Jingya Wang

TL;DR

This work introduces UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands, and proposes a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies.

Abstract

Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.

UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

TL;DR

Abstract

Paper Structure (29 sections, 55 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 55 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Dexterous Grasp Generation.
Vision Language Model for Manipulation
Method
Auto Data Annotation
Unified Hand-Dexterous Tokenizer
Dexterous-Hand Manipulation with VLM
Physical-guided Dynamic Refinement
Experiments
Dataset
Evaluation Metric
Main Result
Ablation Study
Conclusion
...and 14 more sections

Figures (8)

Figure 1: Overview. We introduce UniHM, the first unified hand-manipulation framework conditioned on free-form language. UniHM is trained solely on closed-set HOI datasets to follow target trajectories and execute physically feasible interactions, and generalizes to open-world tasks in real-world interactions.
Figure 2: Pipeline. UniHM converts open-vocabulary instructions and RGB-D inputs into executable dexterous-hand trajectories via three stages: (1) morphology-agnostic motion tokenization; (2) language-guided generation that fuses text, perception, and token history to produce manipulation token sequences; and (3) physics-aware decoding with smoothness/contact priors for feasible, stable execution.
Figure 3: Real-World Results. UniHM achieves higher success rates than prior methods on both seen and unseen objects, producing physically consistent and executable real-world manipulations.
Figure B1: The function plots under varying values of $\alpha$ and $k$ are displayed. Note that $\alpha$ and $k$ control the curve behavior for $x > 0$ and $x < 0$, respectively. For comparison, the curve of $y = |x|$ is plotted using a dashed line.
Figure B2: Optimization results when the point cloud includes noise. Here, the black points represent the object point cloud, and the red/green points denote the positions of the dexterous hand's fingertips. (A) Optimization using the $f(d)$ kernel, visualized on the noisy input. (B) Optimization using the Euclidean distance, visualized on the noisy input. (C) The $f(d)$ kernel optimization result projected onto the clean noise-free point cloud. (D) The Euclidean distance optimization result projected onto the noise-free point cloud.
...and 3 more figures

UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

TL;DR

Abstract

UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)