HYPERmotion: Learning Hybrid Behavior Planning for Autonomous Loco-manipulation
Jin Wang, Rui Dai, Weijie Wang, Luca Rossini, Francesco Ruscelli, Nikos Tsagarakis
TL;DR
HYPERmotion tackles the challenge of long-horizon loco-manipulation by humanoids in unstructured environments. It integrates RL-based learning of whole-body motions with an optimization-based planner, a reusable motion library, and LLM/VLM grounding to map natural-language instructions to sequences of primitive actions. Key contributions include a four-sector methodology (motion generation, morphology selection, LLM-driven planning, and user prompts), sim-to-real deployment on a 38-DoF humanoid, and a morphology selector that leverages depth and 2D vision for ground-aware action selection, achieving zero-shot planning for diverse tasks. The framework demonstrates robust adaptation to new tasks and environments, enabling more autonomous and flexible human-robot collaboration, while identifying limits related to library size, retraining needs, and disturbance handling that guide future work.
Abstract
Enabling robots to autonomously perform hybrid motions in diverse environments can be beneficial for long-horizon tasks such as material handling, household chores, and work assistance. This requires extensive exploitation of intrinsic motion capabilities, extraction of affordances from rich environmental information, and planning of physical interaction behaviors. Despite recent progress has demonstrated impressive humanoid whole-body control abilities, they struggle to achieve versatility and adaptability for new tasks. In this work, we propose HYPERmotion, a framework that learns, selects and plans behaviors based on tasks in different scenarios. We combine reinforcement learning with whole-body optimization to generate motion for 38 actuated joints and create a motion library to store the learned skills. We apply the planning and reasoning features of the large language models (LLMs) to complex loco-manipulation tasks, constructing a hierarchical task graph that comprises a series of primitive behaviors to bridge lower-level execution with higher-level planning. By leveraging the interaction of distilled spatial geometry and 2D observation with a visual language model (VLM) to ground knowledge into a robotic morphology selector to choose appropriate actions in single- or dual-arm, legged or wheeled locomotion. Experiments in simulation and real-world show that learned motions can efficiently adapt to new tasks, demonstrating high autonomy from free-text commands in unstructured scenes. Videos and website: hy-motion.github.io/
