Table of Contents
Fetching ...

RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World

Weixin Mao, Weiheng Zhong, Zhou Jiang, Dong Fang, Zhongyue Zhang, Zihan Lan, Haosheng Li, Fan Jia, Tiancai Wang, Haoqiang Fan, Osamu Yoshie

TL;DR

RoboMatrix reframes robot task execution from task-centric to skill-centric planning in open-world settings by extracting reusable meta-skills and organizing them in a hierarchical three-layer architecture (scheduling, skill, hardware). A unified Vision-Language-Action (VLA) framework, augmented with a Hybrid model, enables simultaneous perception, reasoning, and discrete action generation across movement and manipulation, while a data engine and alignment strategy support scalable data collection and continual improvement. The scheduling layer leverages LLMs for task decomposition, the skill layer maps subtasks to meta-skills, and the hardware layer interfaces with the robot via ROS2 for real-time control and feedback. Empirical results show robust generalization to unseen objects and scenes, with about a $50\%$ boost in success rates on Level-V-like generalization tasks and substantial gains on long-horizon tasks, along with extensive ablations confirming the value of model scale, alignment, and data-centric design. The work advances open-world robotics by delivering a scalable, interpretable, and data-efficient framework and provides open-source code, hardware designs, weights, and datasets to accelerate further research.

Abstract

Existing robot policies predominantly adopt the task-centric approach, requiring end-to-end task data collection. This results in limited generalization to new tasks and difficulties in pinpointing errors within long-horizon, multi-stage tasks. To address this, we propose RoboMatrix, a skill-centric hierarchical framework designed for scalable robot task planning and execution in open-world environments. RoboMatrix extracts general meta-skills from diverse complex tasks, enabling the completion of unseen tasks through skill composition. Its architecture consists of a high-level scheduling layer that utilizes large language models (LLMs) for task decomposition, an intermediate skill layer housing meta-skill models, and a low-level hardware layer for robot control. A key innovation of our work is the introduction of the first unified vision-language-action (VLA) model capable of seamlessly integrating both movement and manipulation within one model. This is achieved by combining vision and language prompts to generate discrete actions. Experimental results demonstrate that RoboMatrix achieves a 50% higher success rate than task-centric baselines when applied to unseen objects, scenes, and tasks. To advance open-world robotics research, we will open-source code, hardware designs, model weights, and datasets at https://github.com/WayneMao/RoboMatrix.

RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World

TL;DR

RoboMatrix reframes robot task execution from task-centric to skill-centric planning in open-world settings by extracting reusable meta-skills and organizing them in a hierarchical three-layer architecture (scheduling, skill, hardware). A unified Vision-Language-Action (VLA) framework, augmented with a Hybrid model, enables simultaneous perception, reasoning, and discrete action generation across movement and manipulation, while a data engine and alignment strategy support scalable data collection and continual improvement. The scheduling layer leverages LLMs for task decomposition, the skill layer maps subtasks to meta-skills, and the hardware layer interfaces with the robot via ROS2 for real-time control and feedback. Empirical results show robust generalization to unseen objects and scenes, with about a boost in success rates on Level-V-like generalization tasks and substantial gains on long-horizon tasks, along with extensive ablations confirming the value of model scale, alignment, and data-centric design. The work advances open-world robotics by delivering a scalable, interpretable, and data-efficient framework and provides open-source code, hardware designs, weights, and datasets to accelerate further research.

Abstract

Existing robot policies predominantly adopt the task-centric approach, requiring end-to-end task data collection. This results in limited generalization to new tasks and difficulties in pinpointing errors within long-horizon, multi-stage tasks. To address this, we propose RoboMatrix, a skill-centric hierarchical framework designed for scalable robot task planning and execution in open-world environments. RoboMatrix extracts general meta-skills from diverse complex tasks, enabling the completion of unseen tasks through skill composition. Its architecture consists of a high-level scheduling layer that utilizes large language models (LLMs) for task decomposition, an intermediate skill layer housing meta-skill models, and a low-level hardware layer for robot control. A key innovation of our work is the introduction of the first unified vision-language-action (VLA) model capable of seamlessly integrating both movement and manipulation within one model. This is achieved by combining vision and language prompts to generate discrete actions. Experimental results demonstrate that RoboMatrix achieves a 50% higher success rate than task-centric baselines when applied to unseen objects, scenes, and tasks. To advance open-world robotics research, we will open-source code, hardware designs, model weights, and datasets at https://github.com/WayneMao/RoboMatrix.

Paper Structure

This paper contains 61 sections, 1 equation, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Task-Centric vs. Skill-Centric. (a) The task-centric paradigm requires collecting new data and training a new model for each new task. (b) The skill-centric paradigm enables zero-error task generalization by activating different skill responses within one fully trained VLA skill model.
  • Figure 2: Inspiration of the skill-centric method. Robots with different modalities can perform different tasks, and robots with the same modality can be used in various scenarios. We extract similar elements from the multitude of diverse robotic tasks, defining these elements as meta-skills and storing them in a skill list. Then, these skills are used to train the Vision-Language-Action (VLA) model or to construct hybrid models, which can eventually lead to a skill model capable of adapting to new tasks.
  • Figure 3: The pipeline of data engine.
  • Figure 4: RoboMatrix Overview. The system accepts the task description in either text or audio format. The text can be entered manually, while the audio is converted into text format by the audio-to-text module. The Modular Scheduling Layer serves as the high-level planner of the system. The agent decomposes complex tasks into an ordered sequence of subtasks based on the robot's skill list and adds them sequentially to the execution queue. Before executing a subtask, the execution checker verifies its executability by determining whether the object to be manipulated or grasped is present in the scene based on the robot's environment observations. The Skill Layer maps the description of subtasks to robot actions using either the hybrid model or the VLA model, with the action including a stop signal to determine whether the current subtask is complete. The Hardware Layer manages the controller and stage observer of the robot, with the controller converting actions into control signals and the stage observer continuously updating the robot's state and image in real-time.
  • Figure 5: The agent prompt and meta-skills list.
  • ...and 13 more figures