Table of Contents
Fetching ...

TextOp: Real-time Interactive Text-Driven Humanoid Robot Motion Generation and Control

Weiji Xie, Jiakun Zheng, Jinrui Han, Jiyuan Shi, Weinan Zhang, Chenjia Bai, Xuelong Li

TL;DR

TextOp is a real-time text-driven humanoid motion generation and control framework that supports streaming language commands and on-the-fly instruction modification during execution that unlocks free-form intent expression and enables smooth transitions across multiple challenging behaviors such as dancing and jumping, within a single continuous motion execution.

Abstract

Recent advances in humanoid whole-body motion tracking have enabled the execution of diverse and highly coordinated motions on real hardware. However, existing controllers are commonly driven either by predefined motion trajectories, which offer limited flexibility when user intent changes, or by continuous human teleoperation, which requires constant human involvement and limits autonomy. This work addresses the problem of how to drive a universal humanoid controller in a real-time and interactive manner. We present TextOp, a real-time text-driven humanoid motion generation and control framework that supports streaming language commands and on-the-fly instruction modification during execution. TextOp adopts a two-level architecture in which a high-level autoregressive motion diffusion model continuously generates short-horizon kinematic trajectories conditioned on the current text input, while a low-level motion tracking policy executes these trajectories on a physical humanoid robot. By bridging interactive motion generation with robust whole-body control, TextOp unlocks free-form intent expression and enables smooth transitions across multiple challenging behaviors such as dancing and jumping, within a single continuous motion execution. Extensive real-robot experiments and offline evaluations demonstrate instant responsiveness, smooth whole-body motion, and precise control. The project page and the open-source code are available at https://text-op.github.io/

TextOp: Real-time Interactive Text-Driven Humanoid Robot Motion Generation and Control

TL;DR

TextOp is a real-time text-driven humanoid motion generation and control framework that supports streaming language commands and on-the-fly instruction modification during execution that unlocks free-form intent expression and enables smooth transitions across multiple challenging behaviors such as dancing and jumping, within a single continuous motion execution.

Abstract

Recent advances in humanoid whole-body motion tracking have enabled the execution of diverse and highly coordinated motions on real hardware. However, existing controllers are commonly driven either by predefined motion trajectories, which offer limited flexibility when user intent changes, or by continuous human teleoperation, which requires constant human involvement and limits autonomy. This work addresses the problem of how to drive a universal humanoid controller in a real-time and interactive manner. We present TextOp, a real-time text-driven humanoid motion generation and control framework that supports streaming language commands and on-the-fly instruction modification during execution. TextOp adopts a two-level architecture in which a high-level autoregressive motion diffusion model continuously generates short-horizon kinematic trajectories conditioned on the current text input, while a low-level motion tracking policy executes these trajectories on a physical humanoid robot. By bridging interactive motion generation with robust whole-body control, TextOp unlocks free-form intent expression and enables smooth transitions across multiple challenging behaviors such as dancing and jumping, within a single continuous motion execution. Extensive real-robot experiments and offline evaluations demonstrate instant responsiveness, smooth whole-body motion, and precise control. The project page and the open-source code are available at https://text-op.github.io/
Paper Structure (28 sections, 14 equations, 7 figures, 13 tables, 2 algorithms)

This paper contains 28 sections, 14 equations, 7 figures, 13 tables, 2 algorithms.

Figures (7)

  • Figure 1: TextOp enables a humanoid robot to execute a seamless sequence of diverse skills—ranging from expressive gestures to complex physical tasks—driven by real-time, interactive text commands from the user in a single continuous trial.
  • Figure 2: Overview of TextOp's framework. The framework consists of three main parts: (a) Interactive Motion Generation, including VAE training and LDM training, which together model future reference motion sequences conditioned on history motion and text prompt in an autoregressive style; (b) Dynamic Motion Tracking, where the MLP-based policy $\pi$ takes reference motions and robot states to generate joint actions, trained in the simulation for stable execution; (c) Deployment, where the real-time user text prompt is converted into motions by the generator, translated into actions by the tracking policy based on the robot state, and executed on the physical robot.
  • Figure 3: Illustration of the time-aligned data format including text labels, SMPL motions, and robot motions.
  • Figure 4: Continuous diverse skill execution in the real robot. The robot seamlessly performs a wide range of tasks, including multiple dance styles, dynamic jumping behaviors, instrument-playing motions, and expressive gestures. For complex long-horizon motions in the private dataset, the entire motion is assigned a unique label wrapped with "$\langle \cdot \rangle$" markers.
  • Figure 5: Real-time recovery under external perturbations. The robot dynamically adjusts its actions based on perturbed states to preserve stability and fulfill text-driven commands.
  • ...and 2 more figures