Table of Contents
Fetching ...

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao

TL;DR

Manual2Skill enables robots to read and act on abstract instruction manuals by combining Vision-Language Models with hierarchical assembly graphs, per-step pose estimation, and motion planning. It reconstructs task structure from manual images, predicts step-wise 6D poses, and plans/executes assembly with planners such as RRT-Connect, demonstrated on real IKEA furniture and extended to other assembly tasks. Key contributions include a two-stage VLM-based graph generation (Stage I and II), a sequential per-step pose estimation dataset with multimodal fusion, and comprehensive real-world and simulated evaluations showing robust performance. This work advances robot learning from human manuals, reducing data requirements and enabling long-horizon manipulation in practical settings.

Abstract

Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real-world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.Project Page: https://owensun2004.github.io/Furniture-Assembly-Web/

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

TL;DR

Manual2Skill enables robots to read and act on abstract instruction manuals by combining Vision-Language Models with hierarchical assembly graphs, per-step pose estimation, and motion planning. It reconstructs task structure from manual images, predicts step-wise 6D poses, and plans/executes assembly with planners such as RRT-Connect, demonstrated on real IKEA furniture and extended to other assembly tasks. Key contributions include a two-stage VLM-based graph generation (Stage I and II), a sequential per-step pose estimation dataset with multimodal fusion, and comprehensive real-world and simulated evaluations showing robust performance. This work advances robot learning from human manuals, reducing data requirements and enabling long-horizon manipulation in practical settings.

Abstract

Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real-world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.Project Page: https://owensun2004.github.io/Furniture-Assembly-Web/

Paper Structure

This paper contains 47 sections, 18 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Framework Overview. (1) GPT-4o achiam2023gpt is queried with manual pages to generate a sequential assembly plan, represented as a hierarchical assembly graph. (2) The furniture components’ point clouds and corresponding manual images are processed by a pose estimation module to predict target poses for each component. (3) The system sequentially executes the assembly by planning and performing robotic actions based on the hierarchical assembly graph and estimated poses.
  • Figure 2: Qualitative results. Our method significantly outperforms the baselines. SingleStep fails on moderately complex furniture, while GeoCluster generates physically impossible subassemblies (highlighted in red). In contrast, our approach closely aligns with the ground truth.
  • Figure 3: Pre-Assembly Scene Variations. (Left) original pre-assembly scene. (Middle) parts randomly shuffled along the ground plane. (Right) parts randomly rotated in-place.
  • Figure 4: Qualitative results on three furniture categories. We observe better pose predictions than baselines.
  • Figure 5: Qualitative Evaluation on real IKEA furniture items. This figure illustrates the assembly process of various IKEA furniture items, including FLISAT, VARIERA, SUNDVIK, and KNAGGLIG, with our approach. For each item, we display the manual images, per-step 3D parts pose estimation results, and real-world assembly outcomes.
  • ...and 8 more figures