Table of Contents
Fetching ...

ExtremControl: Low-Latency Humanoid Teleoperation with Direct Extremity Control

Ziyan Xiong, Lixing Fang, Junyun Huang, Kashu Yamazaki, Hao Zhang, Chuang Gan

TL;DR

This paper introduces ExtremControl, a low-latency, whole-body humanoid teleoperation framework that bypasses full-body retargeting by operating on $SE(3)$ poses of a minimal set of extremity links and employing a velocity feedforward term. A Cartesian-space mapping from human poses to humanoid targets, a one-shot calibration, and a parallelizable per-frame mapping enable real-time control, while a whole-body impedance calibration and a velocity feedforward formulation reduce tracking delay. The authors propose a three-stage policy learning pipeline (teacher, student, teleoperation finetune) with carefully designed observations and rewards, combining PPO and DAgger to achieve robust performance. Through extensive simulation and real-world experiments on a Unitree $G1$ humanoid, ExtremControl achieves end-to-end latency as low as 50 ms and demonstrates dynamic tasks such as ping-pong ball balancing and juggling, showcasing substantial improvements over prior 200 ms baselines and enabling near-perceptual teleoperation. The work also provides a novel optical-flow latency estimation method and discusses limitations related to arm IK ambiguity, lower-body control, and dexterous manipulation, outlining future directions toward a general, low-latency humanoid data-collection platform.

Abstract

Building a low-latency humanoid teleoperation system is essential for collecting diverse reactive and dynamic demonstrations. However, existing approaches rely on heavily pre-processed human-to-humanoid motion retargeting and position-only PD control, resulting in substantial latency that severely limits responsiveness and prevents tasks requiring rapid feedback and fast reactions. To address this problem, we propose ExtremControl, a low latency whole-body control framework that: (1) operates directly on SE(3) poses of selected rigid links, primarily humanoid extremities, to avoid full-body retargeting; (2) utilizes a Cartesian-space mapping to directly convert human motion to humanoid link targets; and (3) incorporates velocity feedforward control at low level to support highly responsive behavior under rapidly changing control interfaces. We further provide a unified theoretical formulation of ExtremControl and systematically validate its effectiveness through experiments in both simulation and real-world environments. Building on ExtremControl, we implement a low-latency humanoid teleoperation system that supports both optical motion capture and VR-based motion tracking, achieving end-to-end latency as low as 50ms and enabling highly responsive behaviors such as ping-pong ball balancing, juggling, and real-time return, thereby substantially surpassing the 200ms latency limit observed in prior work.

ExtremControl: Low-Latency Humanoid Teleoperation with Direct Extremity Control

TL;DR

This paper introduces ExtremControl, a low-latency, whole-body humanoid teleoperation framework that bypasses full-body retargeting by operating on poses of a minimal set of extremity links and employing a velocity feedforward term. A Cartesian-space mapping from human poses to humanoid targets, a one-shot calibration, and a parallelizable per-frame mapping enable real-time control, while a whole-body impedance calibration and a velocity feedforward formulation reduce tracking delay. The authors propose a three-stage policy learning pipeline (teacher, student, teleoperation finetune) with carefully designed observations and rewards, combining PPO and DAgger to achieve robust performance. Through extensive simulation and real-world experiments on a Unitree humanoid, ExtremControl achieves end-to-end latency as low as 50 ms and demonstrates dynamic tasks such as ping-pong ball balancing and juggling, showcasing substantial improvements over prior 200 ms baselines and enabling near-perceptual teleoperation. The work also provides a novel optical-flow latency estimation method and discusses limitations related to arm IK ambiguity, lower-body control, and dexterous manipulation, outlining future directions toward a general, low-latency humanoid data-collection platform.

Abstract

Building a low-latency humanoid teleoperation system is essential for collecting diverse reactive and dynamic demonstrations. However, existing approaches rely on heavily pre-processed human-to-humanoid motion retargeting and position-only PD control, resulting in substantial latency that severely limits responsiveness and prevents tasks requiring rapid feedback and fast reactions. To address this problem, we propose ExtremControl, a low latency whole-body control framework that: (1) operates directly on SE(3) poses of selected rigid links, primarily humanoid extremities, to avoid full-body retargeting; (2) utilizes a Cartesian-space mapping to directly convert human motion to humanoid link targets; and (3) incorporates velocity feedforward control at low level to support highly responsive behavior under rapidly changing control interfaces. We further provide a unified theoretical formulation of ExtremControl and systematically validate its effectiveness through experiments in both simulation and real-world environments. Building on ExtremControl, we implement a low-latency humanoid teleoperation system that supports both optical motion capture and VR-based motion tracking, achieving end-to-end latency as low as 50ms and enabling highly responsive behaviors such as ping-pong ball balancing, juggling, and real-time return, thereby substantially surpassing the 200ms latency limit observed in prior work.
Paper Structure (41 sections, 24 equations, 7 figures, 5 tables)

This paper contains 41 sections, 24 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Tracking objectives for humans and humanoids under VR and MoCap settings.
  • Figure 2: Whole-body impedance calibration of Unitree G1 elbow joint. Solid lines correspond to proportional gains $k_p$, dashed lines depict the effective impedance.
  • Figure 3: Measured tracking delay in simulation and real world as a function of velocity feedforward ratio. Dashed line represents the theoretical value $\frac{2(1-\eta)}{\omega_n}$.
  • Figure 4: Optical flow for latency analysis.
  • Figure 5: Measured latencies from selected video clips. Normalized optical-flow projections of human and robot are displayed on the right side.
  • ...and 2 more figures