Table of Contents
Fetching ...

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F. Karlsson, Zongqing Lu

TL;DR

This paper tackles the challenge of enabling humanoid robots to perform long-horizon embodied tasks by integrating a Foundation Model with a modular skill library through a lightweight vision-language Connector. The Connector grounds high-level language plans into real-time, executable navigation and manipulation skills, substantially improving robustness and efficiency on a full-sized humanoid with active vision. Empirical results show strong task completion rates and notable efficiency gains (e.g., $4.2\times$ faster navigation) across complex scenarios, validating the value of grounding language-based planning in embodied perception. The work advances practical humanoid autonomy by decoupling high-level cognition from low-level control while maintaining real-time performance on onboard hardware, with clear avenues for future extension and safety improvements.

Abstract

Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long-horizon tasks and the varied latency of different modules. We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high-level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low-level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision-language model (VLM). The Connector enhances the FM's embodied capabilities by translating language-based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low-cost onboard computation devices, Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being-0's effectiveness in solving complex, long-horizon tasks that require challenging navigation and manipulation subtasks. For further details and videos, visit https://beingbeyond.github.io/Being-0.

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

TL;DR

This paper tackles the challenge of enabling humanoid robots to perform long-horizon embodied tasks by integrating a Foundation Model with a modular skill library through a lightweight vision-language Connector. The Connector grounds high-level language plans into real-time, executable navigation and manipulation skills, substantially improving robustness and efficiency on a full-sized humanoid with active vision. Empirical results show strong task completion rates and notable efficiency gains (e.g., faster navigation) across complex scenarios, validating the value of grounding language-based planning in embodied perception. The work advances practical humanoid autonomy by decoupling high-level cognition from low-level control while maintaining real-time performance on onboard hardware, with clear avenues for future extension and safety improvements.

Abstract

Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long-horizon tasks and the varied latency of different modules. We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high-level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low-level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision-language model (VLM). The Connector enhances the FM's embodied capabilities by translating language-based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low-cost onboard computation devices, Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being-0's effectiveness in solving complex, long-horizon tasks that require challenging navigation and manipulation subtasks. For further details and videos, visit https://beingbeyond.github.io/Being-0.

Paper Structure

This paper contains 29 sections, 13 figures, 18 tables, 3 algorithms.

Figures (13)

  • Figure 1: Overview of the Being-0 framework. The humanoid agent framework, Being-0, comprises three key components: (1) the Foundation Model (FM) for high-level task planning and reasoning, (2) the Connector, a vision-language model (VLM) that bridges the FM and low-level skills, and (3) the Modular Skill Library for robust locomotion and dexterous manipulation. Together, these components enable Being-0 to effectively control a full-sized humanoid robot equipped with multi-fingered hands and active vision, solving complex, long-horizon embodied tasks in real-world environments.
  • Figure 2: Workflow of Being-0 for the task "make a cup of coffee". The figure illustrates the step-by-step execution of the task, with images arranged in two rows. The execution order proceeds left to right in the first row, then continues left to right in the second row. Images with yellow borders indicate decision-making points for the Foundation Model (FM). The yellow dialog boxes display the FM's plans, the green boxes show decisions made by the Connector, and the blue boxes represent the skills called from the modular skill library.
  • Figure 3: A comparison of Being-0 w/o Connector and Being-0 in the long-horizon task "Prepare-coffee." The first row shows recordings of Being-0 without the Connector, while the second row shows recordings of Being-0 with the Connector. Being-0 w/o Connector frequently queries the FM, which often fails to provide correct plans due to its limited embodied scene understanding. In contrast, Being-0 with the Connector completes the task, requiring only a few queries to the FM.
  • Figure 4: Recordings from the ablation study on the active camera. Each row represents a different camera configuration, with the left three images depicting the navigation task and the right three images depicting the manipulation task. Only Being-0 with an active camera achieves robust performance in both navigation and manipulation.
  • Figure 5: A comparison of Being-0 with and without the adjustment method in two-stage tasks involving navigation and manipulation. Each row corresponds to a specific task, with the left three images showing results for Being-0 w/o Adjustment and the right three images showing results for Being-0. Without adjustment, the agent may terminate navigation in improper poses, leading to failed manipulations.
  • ...and 8 more figures