Table of Contents
Fetching ...

Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration

Nan Sun, Bo Mao, Yongchang Li, Chenxu Wang, Di Guo, Huaping Liu

TL;DR

The paper argues that monolithic foundation models struggle to deliver reliable autonomy in real-world service robotics. It introduces InteractGen, a five-agent, LLM-powered architecture that distributes perception, planning, decision, validation, and reflection across specialized components while maintaining a shared memory for long-horizon tasks. A three-stage training pipeline (imitation learning, GRPO-based ToA grounding, and rejection sampling) enables robust, dependency-aware action planning and execution, with humans acting as deployable agents when needed. Real-world three-month deployment and extensive simulations demonstrate improved task success, adaptability, and user satisfaction, supporting the claim that multi-agent orchestration with human collaboration is a scalable path to socially grounded service autonomy.

Abstract

Foundation models have become central to unifying perception and planning in robotics, yet real-world deployment exposes a mismatch between their monolithic assumption that a single model can handle all cognitive functions and the distributed, dynamic nature of practical service workflows. Vision-language models offer strong semantic understanding but lack embodiment-aware action capabilities while relying on hand-crafted skills. Vision-Language-Action policies enable reactive manipulation but remain brittle across embodiments, weak in geometric grounding, and devoid of proactive collaboration mechanisms. These limitations indicate that scaling a single model alone cannot deliver reliable autonomy for service robots operating in human-populated settings. To address this gap, we present InteractGen, an LLM-powered multi-agent framework that decomposes robot intelligence into specialized agents for continuous perception, dependency-aware planning, decision and verification, failure reflection, and dynamic human delegation, treating foundation models as regulated components within a closed-loop collective. Deployed on a heterogeneous robot team and evaluated in a three-month open-use study, InteractGen improves task success, adaptability, and human-robot collaboration, providing evidence that multi-agent orchestration offers a more feasible path toward socially grounded service autonomy than further scaling standalone models.

Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration

TL;DR

The paper argues that monolithic foundation models struggle to deliver reliable autonomy in real-world service robotics. It introduces InteractGen, a five-agent, LLM-powered architecture that distributes perception, planning, decision, validation, and reflection across specialized components while maintaining a shared memory for long-horizon tasks. A three-stage training pipeline (imitation learning, GRPO-based ToA grounding, and rejection sampling) enables robust, dependency-aware action planning and execution, with humans acting as deployable agents when needed. Real-world three-month deployment and extensive simulations demonstrate improved task success, adaptability, and user satisfaction, supporting the claim that multi-agent orchestration with human collaboration is a scalable path to socially grounded service autonomy.

Abstract

Foundation models have become central to unifying perception and planning in robotics, yet real-world deployment exposes a mismatch between their monolithic assumption that a single model can handle all cognitive functions and the distributed, dynamic nature of practical service workflows. Vision-language models offer strong semantic understanding but lack embodiment-aware action capabilities while relying on hand-crafted skills. Vision-Language-Action policies enable reactive manipulation but remain brittle across embodiments, weak in geometric grounding, and devoid of proactive collaboration mechanisms. These limitations indicate that scaling a single model alone cannot deliver reliable autonomy for service robots operating in human-populated settings. To address this gap, we present InteractGen, an LLM-powered multi-agent framework that decomposes robot intelligence into specialized agents for continuous perception, dependency-aware planning, decision and verification, failure reflection, and dynamic human delegation, treating foundation models as regulated components within a closed-loop collective. Deployed on a heterogeneous robot team and evaluated in a three-month open-use study, InteractGen improves task success, adaptability, and human-robot collaboration, providing evidence that multi-agent orchestration offers a more feasible path toward socially grounded service autonomy than further scaling standalone models.

Paper Structure

This paper contains 48 sections, 17 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: A holistic demonstration of how InteractGen coordinates robots and humans in real time. The framework monitors online and offline environments, performs collaborative reasoning across specialized agents, plans composite workflows, and triggers physical execution through heterogeneous robots. Humans are treated as deployable agents—InteractGen clarifies, notifies, and delegates subtasks when appropriate—enabling socially grounded service autonomy.
  • Figure 2: Overview of the InteractGen architecture. InteractGen naturally supports three operating modes that emerge to handle long-horizon, human-centered scenarios. Reactive: Perceiver perceives task-relevant information and signals Planner to generates a Thought-of-Action plan; Assigner assigns actions to suitable agents. active: Manager triggers clarification for ambiguous cases and Validator's validation elicits re-planning, avoiding rigid generate-then-execute patterns. Proactive: Manager reflects, corrects prior reasoning, and reactivates Perceiver for proactive reasoning. These modes enable interactive coordination with humans and robots in dynamic environments. The act–fail–reflect–replan mechanism in Proactive Mode greatly enhances human–robot collaboration.
  • Figure 3: A proactive case of InteractGen reasoning and coordination. The system executes the instruction through a modular pipeline. The PerceptorHub collects raw sensor and chat signals, which are incrementally updated by the Perceiver. The Planner generates dependency-aware Thought-of-Action steps, while the Manager monitors progress and triggers reflection when inconsistencies arise (e.g., Carl not at his usual location). The Assigner distributes subtasks to robots or humans, and the Validator checks feasibility before execution. This act–reflect–replan loop enables InteractGen to adapt to dynamic human availability and environmental changes, ensuring reliable multi-user task execution in real-world settings.
  • Figure 4: Example of active reasoning in InteractGen. The Manager raises a clarification question when input is ambiguous. After validation fails, the Planner re-generates a feasible plan using updated context. This showcases the system’s ability to actively adapt before execution errors occur.
  • Figure 5: Overview of the Memory Unit. The left panel illustrates the long-term memory graph, which incrementally encodes cross-task knowledge, entity relations, and reusable dependencies accumulated throughout past interactions. The right panel shows the short-term memory, which stores task-specific observations, intermediate reasoning states, and execution feedback to support real-time planning and adaptation.
  • ...and 11 more figures