RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World
Weixin Mao, Weiheng Zhong, Zhou Jiang, Dong Fang, Zhongyue Zhang, Zihan Lan, Haosheng Li, Fan Jia, Tiancai Wang, Haoqiang Fan, Osamu Yoshie
TL;DR
RoboMatrix reframes robot task execution from task-centric to skill-centric planning in open-world settings by extracting reusable meta-skills and organizing them in a hierarchical three-layer architecture (scheduling, skill, hardware). A unified Vision-Language-Action (VLA) framework, augmented with a Hybrid model, enables simultaneous perception, reasoning, and discrete action generation across movement and manipulation, while a data engine and alignment strategy support scalable data collection and continual improvement. The scheduling layer leverages LLMs for task decomposition, the skill layer maps subtasks to meta-skills, and the hardware layer interfaces with the robot via ROS2 for real-time control and feedback. Empirical results show robust generalization to unseen objects and scenes, with about a $50\%$ boost in success rates on Level-V-like generalization tasks and substantial gains on long-horizon tasks, along with extensive ablations confirming the value of model scale, alignment, and data-centric design. The work advances open-world robotics by delivering a scalable, interpretable, and data-efficient framework and provides open-source code, hardware designs, weights, and datasets to accelerate further research.
Abstract
Existing robot policies predominantly adopt the task-centric approach, requiring end-to-end task data collection. This results in limited generalization to new tasks and difficulties in pinpointing errors within long-horizon, multi-stage tasks. To address this, we propose RoboMatrix, a skill-centric hierarchical framework designed for scalable robot task planning and execution in open-world environments. RoboMatrix extracts general meta-skills from diverse complex tasks, enabling the completion of unseen tasks through skill composition. Its architecture consists of a high-level scheduling layer that utilizes large language models (LLMs) for task decomposition, an intermediate skill layer housing meta-skill models, and a low-level hardware layer for robot control. A key innovation of our work is the introduction of the first unified vision-language-action (VLA) model capable of seamlessly integrating both movement and manipulation within one model. This is achieved by combining vision and language prompts to generate discrete actions. Experimental results demonstrate that RoboMatrix achieves a 50% higher success rate than task-centric baselines when applied to unseen objects, scenes, and tasks. To advance open-world robotics research, we will open-source code, hardware designs, model weights, and datasets at https://github.com/WayneMao/RoboMatrix.
