Table of Contents
Fetching ...

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

Rongfeng Zhao, Xuanhao Zhang, Zhaochen Guo, Xiang Shao, Zhongpan Zhu, Bin He, Jie Chen

Abstract

The integration of large language models (LLMs) with embodied agents has improved high-level reasoning capabilities; however, a critical gap remains between semantic understanding and physical execution. While vision-language-action (VLA) and vision-language-navigation (VLN) systems enable robots to perform manipulation and navigation tasks from natural language instructions, they still struggle with long-horizon sequential and temporally structured tasks. Existing frameworks typically adopt modular pipelines for data collection, skill training, and policy deployment, resulting in high costs in experimental validation and policy optimization. To address these limitations, we propose ROSClaw, an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision-language model (VLM) controller. The framework leverages e-URDF representations of heterogeneous robots as physical constraints to construct a sim-to-real topological mapping, enabling real-time access to the physical states of both simulated and real-world agents. We further incorporate a data collection and state accumulation mechanism that stores robot states, multimodal observations, and execution trajectories during real-world execution, enabling subsequent iterative policy optimization. During deployment, a unified agent maintains semantic continuity between reasoning and execution, and dynamically assigns task-specific control to different agents, thereby improving robustness in multi-policy execution. By establishing an autonomous closed-loop framework, ROSClaw minimizes the reliance on robot-specific development workflows. The framework supports hardware-level validation, automated generation of SDK-level control programs, and tool-based execution, enabling rapid cross-platform transfer and continual improvement of robotic skills. Ours project page: https://www.rosclaw.io/.

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

Abstract

The integration of large language models (LLMs) with embodied agents has improved high-level reasoning capabilities; however, a critical gap remains between semantic understanding and physical execution. While vision-language-action (VLA) and vision-language-navigation (VLN) systems enable robots to perform manipulation and navigation tasks from natural language instructions, they still struggle with long-horizon sequential and temporally structured tasks. Existing frameworks typically adopt modular pipelines for data collection, skill training, and policy deployment, resulting in high costs in experimental validation and policy optimization. To address these limitations, we propose ROSClaw, an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision-language model (VLM) controller. The framework leverages e-URDF representations of heterogeneous robots as physical constraints to construct a sim-to-real topological mapping, enabling real-time access to the physical states of both simulated and real-world agents. We further incorporate a data collection and state accumulation mechanism that stores robot states, multimodal observations, and execution trajectories during real-world execution, enabling subsequent iterative policy optimization. During deployment, a unified agent maintains semantic continuity between reasoning and execution, and dynamically assigns task-specific control to different agents, thereby improving robustness in multi-policy execution. By establishing an autonomous closed-loop framework, ROSClaw minimizes the reliance on robot-specific development workflows. The framework supports hardware-level validation, automated generation of SDK-level control programs, and tool-based execution, enabling rapid cross-platform transfer and continual improvement of robotic skills. Ours project page: https://www.rosclaw.io/.

Paper Structure

This paper contains 10 sections, 5 figures.

Figures (5)

  • Figure 1: The ROSClaw framework adopts a three-layer semantic–physical architecture to bridge high-level cognitive reasoning and low-level physical control. The cognitive layer relies on LLM knowledge graphs and logical elements in the digital space to support macro-level task understanding. The coordination automation layer abstracts hardware heterogeneity through an Online Tool Pool and enables task environment activation. The ROSClaw physical world provides unified control over heterogeneous robotic agents while continuously accumulating multimodal observations, robot states, and reusable skills within a Local Resource Pool. The accumulated interaction experience is fed back to the cognitive layer, forming a closed-loop process that supports continual system evolution and cross-task knowledge reuse.
  • Figure 2: e-URDF-based physical firewall. Heterogeneous system resources (SDKs, MCPs, and APIs) are aggregated into an Online Tool Pool to enable the transformation of abstract instructions into executable operations. A strict e-URDF-based physical safeguarding mechanism for heterogeneous agents is adopted, leveraging forward dynamics simulation and collision detection in Isaac Lab to ensure physical feasibility prior to scheduling.
  • Figure 3: Real-World Environment for Collaborative Tasks. In the physical environment, $S_1$denotes the mobile manipulation region, $S_2$represents the localized grasping region, and $S_3$ corresponds to the mobile navigation region. Each embodied agent in the physical world is constrained to operate only within a subset of these regions.
  • Figure 4: Heterogeneous Multi-Agent Collaboration. In (A), ROSClaw receives user requirements, initializes the sub-agents, assigns tasks to each agent, and simultaneously exchanges state information with them. In (B), the log outputs generated during task execution by each sub-agent are presented. In (C), the physical-world execution is illustrated: the mobile robotic arm approaches the doorway and opens the door; the humanoid robot enters and moves toward the harvesting area; the fixed robotic arm transfers the fruit; and the humanoid robot carries the fruit basket to the kitchen.
  • Figure 5: Validation of e-URDF-based physical safeguarding and the data collection and state accumulation mechanism. In (A), a mobile user interacts with ROSClaw to activate perception and manipulation by the arm agent in the physical world, while simultaneously triggering the data collection and state accumulation mechanism to record agent states and environmental perception data. In (B), ROSClaw generates music based on user instructions, initiates e-URDF-based physical safeguarding to orchestrate a dance routine, and drives the physical system to enable coordinated dancing of seven gimbal units.