Table of Contents
Fetching ...

VLM-driven Skill Selection for Robotic Assembly Tasks

Jeong-Jung Kim, Doo-Yeol Koh, Chang-Hyun Kim

TL;DR

This work addresses the challenge of long-horizon robotic assembly by integrating Vision-Language Models with imitation learning in a hierarchical, modular framework. A two-stage VLM pipeline first marks scene objects and then reasons about a sequence of primitive skills (pick, place, insert, done, init) that are executed by learned policies, enabling flexible and interpretable assembly. Across simulations and real-world tests, the approach demonstrates that VLM-guided planning combined with imitation learning improves task reliability and scalability, with newer VLM architectures delivering notable gains in spatial reasoning and sim-to-real transfer. The framework supports forward-looking integration of advancing visual reasoning models, offering a scalable path toward autonomous, knowledge-driven industrial assembly.

Abstract

This paper presents a robotic assembly framework that combines Vision-Language Models (VLMs) with imitation learning for assembly manipulation tasks. Our system employs a gripper-equipped robot that moves in 3D space to perform assembly operations. The framework integrates visual perception, natural language understanding, and learned primitive skills to enable flexible and adaptive robotic manipulation. Experimental results demonstrate the effectiveness of our approach in assembly scenarios, achieving high success rates while maintaining interpretability through the structured primitive skill decomposition.

VLM-driven Skill Selection for Robotic Assembly Tasks

TL;DR

This work addresses the challenge of long-horizon robotic assembly by integrating Vision-Language Models with imitation learning in a hierarchical, modular framework. A two-stage VLM pipeline first marks scene objects and then reasons about a sequence of primitive skills (pick, place, insert, done, init) that are executed by learned policies, enabling flexible and interpretable assembly. Across simulations and real-world tests, the approach demonstrates that VLM-guided planning combined with imitation learning improves task reliability and scalability, with newer VLM architectures delivering notable gains in spatial reasoning and sim-to-real transfer. The framework supports forward-looking integration of advancing visual reasoning models, offering a scalable path toward autonomous, knowledge-driven industrial assembly.

Abstract

This paper presents a robotic assembly framework that combines Vision-Language Models (VLMs) with imitation learning for assembly manipulation tasks. Our system employs a gripper-equipped robot that moves in 3D space to perform assembly operations. The framework integrates visual perception, natural language understanding, and learned primitive skills to enable flexible and adaptive robotic manipulation. Experimental results demonstrate the effectiveness of our approach in assembly scenarios, achieving high success rates while maintaining interpretability through the structured primitive skill decomposition.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: VLM-driven robotic assembly framework showing the iterative process from visual input through two-stage VLM processing to skill execution.
  • Figure 2: Prompt architecture integrating task description, state analysis, and action specification
  • Figure 3: VLM-based primitive selection results for simulation environment.
  • Figure 4: VLM-based primitive selection results for real environment 1.
  • Figure 5: VLM-based primitive selection results for real environment 2.
  • ...and 1 more figures