Table of Contents
Fetching ...

SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control

Yuxuan Wang, Haobin Jiang, Shiqing Yao, Ziluo Ding, Zongqing Lu

TL;DR

SENTINEL tackles the challenge of teaching humanoid robots to follow natural language commands through end-to-end language–action mapping, removing intermediate motion representations. It builds a large, language-grounded dataset by using a pretrained whole-body controller to track human motions in simulation and annotating them with text, then trains a transformer-based model with a flow-matching action head to predict action chunks from language and proprioceptive history, complemented by a lightweight residual head for sim-to-real refinement. The approach supports multimodal extensions by translating visual or other sensory inputs into language signals and demonstrates strong language grounding, stable execution in both simulation and real-world deployment on a Unitree G1, and effective zero-shot sim-to-real transfer as well as waypoint navigation. These results suggest a scalable path toward language-driven, general-purpose humanoid control with robust performance under dynamics and perception variations. The work establishes SENTINEL as the first fully end-to-end language–action model for humanoid control without intermediate representations, with practical implications for accessible, adaptable robotic systems.

Abstract

Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.

SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control

TL;DR

SENTINEL tackles the challenge of teaching humanoid robots to follow natural language commands through end-to-end language–action mapping, removing intermediate motion representations. It builds a large, language-grounded dataset by using a pretrained whole-body controller to track human motions in simulation and annotating them with text, then trains a transformer-based model with a flow-matching action head to predict action chunks from language and proprioceptive history, complemented by a lightweight residual head for sim-to-real refinement. The approach supports multimodal extensions by translating visual or other sensory inputs into language signals and demonstrates strong language grounding, stable execution in both simulation and real-world deployment on a Unitree G1, and effective zero-shot sim-to-real transfer as well as waypoint navigation. These results suggest a scalable path toward language-driven, general-purpose humanoid control with robust performance under dynamics and perception variations. The work establishes SENTINEL as the first fully end-to-end language–action model for humanoid control without intermediate representations, with practical implications for accessible, adaptable robotic systems.

Abstract

Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.

Paper Structure

This paper contains 32 sections, 17 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Overview of SENTINEL. Our framework consists of three stages. (1) We construct a language-action dataset by using a whole body controller to track human motion data paired with natural language descriptions. (2) We train an end-to-end language–action model with flow matching action head, which predicts a robot action chunk conditioned on both the proprioceptive state history and the language command. (3) A post-training stage with a residual action head is introduced to enhance its performance.
  • Figure 2: Integration of visual perception into SENTINEL for navigation tasks. The onboard D435 camera captures front-view RGB-D images, which are processed by FoundationPose wen2024foundationpose to estimate the target position in the robot’s egocentric frame. The estimated waypoint is then inserted into natural-language command templates and provided to SENTINEL, together with the robot’s proprioceptive state, to generate whole body control actions. This closed-loop process enables the robot to iteratively approach the visual target.
  • Figure 3: Comparison between MDM + Retarget tevet2023human and our method on an example text prompt: “jumps up in a tight twirl.”.
  • Figure 3: Ablation study results for model design.
  • Figure 4: Results for different model sizes.
  • ...and 6 more figures