Table of Contents
Fetching ...

Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models

Alexander Htet Kyaw, Richa Gupta, Dhruv Shah, Anoop Sinha, Kory Mathewson, Stefanie Pender, Sachin Chitta, Yotto Koga, Faez Ahmed, Lawrence Sass, Randall Davis

TL;DR

The paper addresses the challenge of turning text prompts into physically assembleable multi-component objects by integrating 3D generative AI with vision-language models (VLMs) and robotics. It introduces a function- and geometry-aware, zero-shot VLM reasoning pipeline that decomposes AI-generated meshes into predefined structural and panel components and maps them to robot-graspable faces, with a human-in-the-loop via conversational feedback. The end-to-end framework exports a coordinate list and component-type list for robotic placement and demonstrates multi-component assembly on a UR20 robot, achieving strong user preference for VLM-driven panel assignments (vs rule-based and random baselines) and enabling user refinements through natural language. The work advances human-AI co-creation for physical fabrication by combining natural language prompts, multi-modal reasoning, and robotic execution, while outlining limitations (fixed component library, simple prompts) and directions for expanding component types and interactive control.

Abstract

Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on the object's geometry and functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.

Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models

TL;DR

The paper addresses the challenge of turning text prompts into physically assembleable multi-component objects by integrating 3D generative AI with vision-language models (VLMs) and robotics. It introduces a function- and geometry-aware, zero-shot VLM reasoning pipeline that decomposes AI-generated meshes into predefined structural and panel components and maps them to robot-graspable faces, with a human-in-the-loop via conversational feedback. The end-to-end framework exports a coordinate list and component-type list for robotic placement and demonstrates multi-component assembly on a UR20 robot, achieving strong user preference for VLM-driven panel assignments (vs rule-based and random baselines) and enabling user refinements through natural language. The work advances human-AI co-creation for physical fabrication by combining natural language prompts, multi-modal reasoning, and robotic execution, while outlining limitations (fixed component library, simple prompts) and directions for expanding component types and interactive control.

Abstract

Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on the object's geometry and functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.

Paper Structure

This paper contains 12 sections, 2 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: From text input to multi-component robotic assembly using predetermined components
  • Figure 2: System Pipeline: Vision Language Model for Function and Geometry Aware Part Selection
  • Figure 3: User Alignment: Integrating human feedback with geometry-aware VLM part assignment
  • Figure 4: Text to multi-component robotic assembly of the user prompt: "Make me a chair", with the user feedback: "I want panels on the seat".
  • Figure 5: Multi-component assemblies of five different objects created using three different approaches: random, rule-based, and vision-language models
  • ...and 9 more figures