SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control
Yuxuan Wang, Haobin Jiang, Shiqing Yao, Ziluo Ding, Zongqing Lu
TL;DR
SENTINEL tackles the challenge of teaching humanoid robots to follow natural language commands through end-to-end language–action mapping, removing intermediate motion representations. It builds a large, language-grounded dataset by using a pretrained whole-body controller to track human motions in simulation and annotating them with text, then trains a transformer-based model with a flow-matching action head to predict action chunks from language and proprioceptive history, complemented by a lightweight residual head for sim-to-real refinement. The approach supports multimodal extensions by translating visual or other sensory inputs into language signals and demonstrates strong language grounding, stable execution in both simulation and real-world deployment on a Unitree G1, and effective zero-shot sim-to-real transfer as well as waypoint navigation. These results suggest a scalable path toward language-driven, general-purpose humanoid control with robust performance under dynamics and perception variations. The work establishes SENTINEL as the first fully end-to-end language–action model for humanoid control without intermediate representations, with practical implications for accessible, adaptable robotic systems.
Abstract
Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.
