ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Haozhe Jia; Jianfei Song; Yuan Zhang; Honglei Jin; Youcheng Fan; Wenshuo Chen; Wei Zhang; Yutao Yue

ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Haozhe Jia, Jianfei Song, Yuan Zhang, Honglei Jin, Youcheng Fan, Wenshuo Chen, Wei Zhang, Yutao Yue

Abstract

We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher--Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.

ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Abstract

Paper Structure (17 sections, 6 equations, 2 figures, 6 tables)

This paper contains 17 sections, 6 equations, 2 figures, 6 tables.

INTRODUCTION
RELATED WORK
Motion Generation
Whole Body Motion Tracking
METHOD
Robot-Skeleton Motion Representation.
Cloud: Text-to-Motion Generator
Tracker
Deployment
EXPERIMENTS AND RESULTS
Evaluation of Generation Quality
Real-World Deployment
DISCUSSION AND CONCLUSION
Evaluation Metrics
Reward Functions
...and 2 more sections

Figures (2)

Figure 1: Overview of the proposed framework: The system features a Cloud-Edge decoupled deployment. The Cloud module utilizes a Diffusion Generator to synthesize motion from text instructions via CLIP encoding. The Edge module employs an RL-trained Student Policy that tracks targets using estimated privileged information. The resulting actions are executed via a PD controller for stable humanoid motion in the real world.
Figure 2: Sim-to-Real Results: Validation of robust tracking performance from simple gestures to dynamic maneuvers.

ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Abstract

ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Authors

Abstract

Table of Contents

Figures (2)