Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

Hao Zhang; Ding Zhao; H. Eric Tseng

Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

Hao Zhang, Ding Zhao, H. Eric Tseng

TL;DR

C2C is a three-layer hierarchy that makes the deliberation-to-control pathway explicit, and shows higher success and robustness than single-agent and end-to-end baselines, with stable coordination and emergent leader-follower behaviors.

Abstract

Effective human-robot collaboration (HRC) requires translating high-level intent into contact-stable whole-body motion while continuously adapting to a human partner. Many vision-language-action (VLA) systems learn end-to-end mappings from observations and instructions to actions, but they often emphasize reactive (System 1-like) behavior and leave under-specified how sustained System 2-style deliberation can be integrated with reliable, low-latency continuous control. This gap is acute in multi-agent HRC, where long-horizon coordination decisions and physical execution must co-evolve under contact, feasibility, and safety constraints. We address this limitation with cognition-to-control (C2C), a three-layer hierarchy that makes the deliberation-to-control pathway explicit: (i) a VLM-based grounding layer that maintains persistent scene referents and infers embodiment-aware affordances/constraints; (ii) a deliberative skill/coordination layer-the System 2 core-that optimizes long-horizon skill choices and sequences under human-robot coupling via decentralized MARL cast as a Markov potential game with a shared potential encoding task progress; and (iii) a whole-body control layer that executes the selected skills at high frequency while enforcing kinematic/dynamic feasibility and contact stability. The deliberative layer is realized as a residual policy relative to a nominal controller, internalizing partner dynamics without explicit role assignment. Experiments on collaborative manipulation tasks show higher success and robustness than single-agent and end-to-end baselines, with stable coordination and emergent leader-follower behaviors.

Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

TL;DR

Abstract

Paper Structure (24 sections, 11 equations, 6 figures, 5 tables)

This paper contains 24 sections, 11 equations, 6 figures, 5 tables.

Introduction
Related Work
VLM-based Planning and Granularity Bottlenecks
Scripted and Single-Agent Approaches to HRC
Role-Based and Intent-Aware Collaboration Models
Joint Evolution of Multi-Agent Reinforcement Learning
Methodology
Hierarchical Task-Centric MARL Formulation
Cognitive Layer: Decentralized Multi-View Consensus
Spatial grounding via distributed perspectives
Visual prompting and collective intent synthesis
Skill Policy Layer: Tactical Coordination via MARL
Observation space construction
Residual action parameterization
Task-centric reward and heterogeneous learning
...and 9 more sections

Figures (6)

Figure 1: Demonstration of human-robot collaboration via cognition-to-control hierarchy: (a) the humanoid and human partner collaboratively transport a caster-mounted object while performing real-time heading adjustments; (b) seamless transition between the following and leading roles during the cooperative task; (c) coordination between the humanoid and human to pass through a constrained gate; (d) stable super-long object transport throughout a corridor.
Figure 2: Overview of the proposed cognitive-to-physical hierarchy for HRC decision partitioned into decoupled layers.
Figure 3: The proposed hierarchical HRC framework for humanoid-object coordination, partitioning decision-making into three cascade layers: a cognition layer (VLM) generates semantic-aware object moving direction (anchors) from visual input; a skill policy layer (MARL), where agents maintain independent, to derive tactical coordination commands; and a cerebellum Layer (WBC) for high-frequency whole-body stabilization and joint-level execution.
Figure 4: (a) Episode return during training for the scripted IPPO and the three MARL solvers (HAPPO, HATRPO, PCGrad) over $2.0 \times 10^9$ steps. (b) Mean success rate (SR),by task category (OSP, SCT, SLH), comparing the robot-script baseline with the same MARL methods. (c) Real-world deployment: success rate, completion time $\Gamma$ (s), and mean object tilt rate $\dot{\alpha}$ ($^\circ$/s) for the single-agent baseline versus the MARL candidate.
Figure 5: Visualization of VLM cognitive reasoning and benchmarking results. (a)-(c) illustrate the spatial reasoning in the $S_{33}$ task, where -1 and -2 refer to the views of different agents. The cyan lines represent synthetic LiDAR rays, and the green dot denotes the anchor guiding the skill layer. (d)-(f) present the task and success rate (SR) statistics for all scenarios.
...and 1 more figures

Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

TL;DR

Abstract

Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

Authors

TL;DR

Abstract

Table of Contents

Figures (6)