Table of Contents
Fetching ...

Humanoid Agent via Embodied Chain-of-Action Reasoning with Multimodal Foundation Models for Zero-Shot Loco-Manipulation

Congcong Wen, Geeta Chandra Raju Bethala, Yu Hao, Niraj Pudasaini, Hao Huang, Shuaihang Yuan, Baoru Huang, Anh Nguyen, Mengyu Wang, Anthony Tzes, Yi Fang

TL;DR

Humanoid loco-manipulation is challenged by grounding natural-language instructions into long-horizon embodied actions. The paper presents Humanoid-COA, a perception–reasoning–action framework using Embodied Chain-of-Action reasoning to decompose high-level intent into executable loco-manipulation primitives, guided by object affordances, region-based spatial priors, and whole-body feasibility. Through real-world experiments on Unitree H1-2 and G1, the approach demonstrates robust zero-shot generalization, outperforming baselines in manipulation, locomotion, and integrated loco-manipulation, especially under occlusion and long-horizon conditions. These results validate the effectiveness of embedding structured CoA reasoning within foundation-model-driven planning for humanoid robots in unstructured environments.

Abstract

Humanoid loco-manipulation, which integrates whole-body locomotion with dexterous manipulation, remains a fundamental challenge in robotics. Beyond whole-body coordination and balance, a central difficulty lies in understanding human instructions and translating them into coherent sequences of embodied actions. Recent advances in foundation models provide transferable multimodal representations and reasoning capabilities, yet existing efforts remain largely restricted to either locomotion or manipulation in isolation, with limited applicability to humanoid settings. In this paper, we propose Humanoid-COA, the first humanoid agent framework that integrates foundation model reasoning with an Embodied Chain-of-Action (CoA) mechanism for zero-shot loco-manipulation. Within the perception--reasoning--action paradigm, our key contribution lies in the reasoning stage, where the proposed CoA mechanism decomposes high-level human instructions into structured sequences of locomotion and manipulation primitives through affordance analysis, spatial inference, and whole-body action reasoning. Extensive experiments on two humanoid robots, Unitree H1-2 and G1, in both an open test area and an apartment environment, demonstrate that our framework substantially outperforms prior baselines across manipulation, locomotion, and loco-manipulation tasks, achieving robust generalization to long-horizon and unstructured scenarios. Project page: https://humanoid-coa.github.io/

Humanoid Agent via Embodied Chain-of-Action Reasoning with Multimodal Foundation Models for Zero-Shot Loco-Manipulation

TL;DR

Humanoid loco-manipulation is challenged by grounding natural-language instructions into long-horizon embodied actions. The paper presents Humanoid-COA, a perception–reasoning–action framework using Embodied Chain-of-Action reasoning to decompose high-level intent into executable loco-manipulation primitives, guided by object affordances, region-based spatial priors, and whole-body feasibility. Through real-world experiments on Unitree H1-2 and G1, the approach demonstrates robust zero-shot generalization, outperforming baselines in manipulation, locomotion, and integrated loco-manipulation, especially under occlusion and long-horizon conditions. These results validate the effectiveness of embedding structured CoA reasoning within foundation-model-driven planning for humanoid robots in unstructured environments.

Abstract

Humanoid loco-manipulation, which integrates whole-body locomotion with dexterous manipulation, remains a fundamental challenge in robotics. Beyond whole-body coordination and balance, a central difficulty lies in understanding human instructions and translating them into coherent sequences of embodied actions. Recent advances in foundation models provide transferable multimodal representations and reasoning capabilities, yet existing efforts remain largely restricted to either locomotion or manipulation in isolation, with limited applicability to humanoid settings. In this paper, we propose Humanoid-COA, the first humanoid agent framework that integrates foundation model reasoning with an Embodied Chain-of-Action (CoA) mechanism for zero-shot loco-manipulation. Within the perception--reasoning--action paradigm, our key contribution lies in the reasoning stage, where the proposed CoA mechanism decomposes high-level human instructions into structured sequences of locomotion and manipulation primitives through affordance analysis, spatial inference, and whole-body action reasoning. Extensive experiments on two humanoid robots, Unitree H1-2 and G1, in both an open test area and an apartment environment, demonstrate that our framework substantially outperforms prior baselines across manipulation, locomotion, and loco-manipulation tasks, achieving robust generalization to long-horizon and unstructured scenarios. Project page: https://humanoid-coa.github.io/

Paper Structure

This paper contains 27 sections, 11 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The proposed Humanoid Agent Framework for loco-manipulation, consisting of three stages: (i) Perception and Understanding, where ego-centric observations are converted into scene descriptions and, together with human instructions, tokenized for reasoning; (ii) Reasoning and Planning, where a large language model with Embodied Chain-of-Action Reasoning generates symbolic action plans via affordance, spatial, and whole-body inference; and (iii) Execution and Control, where plans are grounded into primitive commands and translated into low-level motor control for humanoid execution.
  • Figure 2: Example of the proposed Embodied Chain-of-Action Reasoning. Given a natural language instruction, the framework sequentially performs Object Affordance Analysis to extract target properties and feasible actions, Region Spatial Reasoning to handle occlusion and prioritize search areas, and Whole-Body Movement Inference to map symbolic primitives onto the humanoid’s sensorimotor system.
  • Figure 3: Real-world humanoid loco-manipulation tasks performed by two robots, Unitree H1-2 and G1, in two different scenarios: an open area and an apartment environment. Each task is specified by a human instruction (left), and the robot executes the corresponding action sequence to complete it (right), covering manipulation, locomotion, and integrated loco-manipulation.