DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation

Yifan Han; Zhongxi Chen; Yuxuan Zhao; Congsheng Xu; Yanming Shao; Yichuan Peng; Yao Mu; Wenzhao Lian

DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation

Yifan Han, Zhongxi Chen, Yuxuan Zhao, Congsheng Xu, Yanming Shao, Yichuan Peng, Yao Mu, Wenzhao Lian

TL;DR

DexHiL is presented, the first integrated arm-hand human-in-the-loop framework for dexterous VLA models, enabling coordinated interventions over the arm and the dexterous hand within a single system, alongside a lightweight teleoperation interface that supports instantaneous human corrections during execution.

Abstract

While Vision-Language-Action (VLA) models have demonstrated promising generalization capabilities in robotic manipulation, deploying them on specific and complex downstream tasks still demands effective post-training. In parallel, Human-in-the-Loop (HiL) learning has proven to be a powerful mechanism for refining robot policies. However, extending this paradigm to dexterous manipulation remains challenging: multi-finger control is high-dimensional, contact-intensive, and exhibits execution distributions that differ markedly from standard arm motions, leaving existing dexterous VLA systems limited in reliability and adaptability. We present DexHiL, the first integrated arm-hand human-in-the-loop framework for dexterous VLA models, enabling coordinated interventions over the arm and the dexterous hand within a single system. DexHiL introduces an intervention-aware data sampling strategy that prioritizes corrective segments for post-training, alongside a lightweight teleoperation interface that supports instantaneous human corrections during execution. Real-robot experiments demonstrate that DexHiL serves as an effective post-training framework, yielding a substantial performance leap, outperforming standard offline-only fine-tuning baselines by an average of 25% in success rates across distinct tasks. Project page: https://chenzhongxi-sjtu.github.io/dexhil/

DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation

TL;DR

Abstract

Paper Structure (23 sections, 12 equations, 5 figures, 1 table)

This paper contains 23 sections, 12 equations, 5 figures, 1 table.

Introduction
Related work
Dexterous Manipulation Data Collection System
Vision Language Action model for dexterous manipulation
Human-in-the-Loop Corrections for Robot Learning
Method
Interactive Human-in-the-Loop Teleoperation System for Dexterous Manipulation
Hand Joint Retargeting
Arm Pose Mapping
Asynchronous Multi-threaded Control and Intervention
Human-in-the-Loop Post-training pipeline
Intervention-aware Weighting Mechanism
Policy Update Mechanism
Warm-up Phase
Online Training Phase
...and 8 more sections

Figures (5)

Figure 1: While scaling offline data for VLA models yields slow accuracy gains and performance plateaus, DexHiL integrates offline training with online Human-in-the-Loop interventions. By strategically reweighting offline and corrective online data, our approach achieves high data efficiency and rapid accuracy growth.
Figure 2: The DexHiL Framework. Below we will introduce our overall framework of DexHiL from data acquisition system, human-in-the-loop intervention paradigm, dexterous manipulation VLA model structure to offline-to-online training process. (a) We propose an arm-hand data collection system both supporting teleoperation offline data collection and online human-in-the-loop policy intervention data collection. We also propose a two-stage training method for precise hand joint retargeting. (b) We propose an asynchronous human-in-the-loop policy intervention mechanism for online data collection, here we examplify the "Plush Toy Grasping" case. (c) Our dexterous manipulation VLA policy follows Being-H0.5 luo2026being structure, which utilizing MoT (Mixture of Transformer) to relate understanding model with action expert for multi-modal reasoning and action generation and inherit the open-source pretrained weights of it. (d) We propose a two-phase training framework that ultilize both offline dataset and online dataset. In the first warm-up phase, we finetuned the pretrained weights into warm-up model. In the DAgger loop, we utilize the system above to acquire online dataset and use reweighting training to update the policy which will be applied in the next DAgger loop.
Figure 3: Real-world rollouts of dexterous manipulation tasks. (Up) Tissue Extraction: The system achieves precise fingertip alignment and vertical retraction to extract the tissue. (Down) Plush Toy Grasping: The controller executes a synchronized multi-joint flexion to securely envelop and lift the deformable object.
Figure 4: Visualization of retargeting results for four representative gestures. We show the input human hand poses and the corresponding configurations generated by Dex-retargeting qin2023anyteleop, GeoRT yin2025geometric, and our method.Compared to other methods, our retargeting algorithm generates more accurate, smooth, and coordinated hand poses.
Figure 5: Quantitative performance and training loss analysis. (Left) Success rates across three consecutive training rounds for both Tissue Extraction and Plush Toy Grasping tasks. (Right) Initial training loss at step 10 for both DexHiL and DAgger*. While the loss previously plateaued between 0.002 and 0.008, it significantly increases after incorporating human corrective trajectories.

DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation

TL;DR

Abstract

DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)