VLM-driven Skill Selection for Robotic Assembly Tasks
Jeong-Jung Kim, Doo-Yeol Koh, Chang-Hyun Kim
TL;DR
This work addresses the challenge of long-horizon robotic assembly by integrating Vision-Language Models with imitation learning in a hierarchical, modular framework. A two-stage VLM pipeline first marks scene objects and then reasons about a sequence of primitive skills (pick, place, insert, done, init) that are executed by learned policies, enabling flexible and interpretable assembly. Across simulations and real-world tests, the approach demonstrates that VLM-guided planning combined with imitation learning improves task reliability and scalability, with newer VLM architectures delivering notable gains in spatial reasoning and sim-to-real transfer. The framework supports forward-looking integration of advancing visual reasoning models, offering a scalable path toward autonomous, knowledge-driven industrial assembly.
Abstract
This paper presents a robotic assembly framework that combines Vision-Language Models (VLMs) with imitation learning for assembly manipulation tasks. Our system employs a gripper-equipped robot that moves in 3D space to perform assembly operations. The framework integrates visual perception, natural language understanding, and learned primitive skills to enable flexible and adaptive robotic manipulation. Experimental results demonstrate the effectiveness of our approach in assembly scenarios, achieving high success rates while maintaining interpretability through the structured primitive skill decomposition.
