Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset

Andrew Goldberg; Kavish Kondap; Tianshuang Qiu; Zehan Ma; Letian Fu; Justin Kerr; Huang Huang; Kaiyuan Chen; Kuan Fang; Ken Goldberg

Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset

Andrew Goldberg, Kavish Kondap, Tianshuang Qiu, Zehan Ma, Letian Fu, Justin Kerr, Huang Huang, Kaiyuan Chen, Kuan Fang, Ken Goldberg

TL;DR

Blox-Net is presented, a GDfRA system that combines generative vision language models with well-established methods in computer vision, simulation, perturbation analysis, motion planning, and physical robot experimentation to solve a class of GDfRA problems without human supervision.

Abstract

Generative AI systems have shown impressive capabilities in creating text, code, and images. Inspired by the rich history of research in industrial ''Design for Assembly'', we introduce a novel problem: Generative Design-for-Robot-Assembly (GDfRA). The task is to generate an assembly based on a natural language prompt (e.g., ''giraffe'') and an image of available physical components, such as 3D-printed blocks. The output is an assembly, a spatial arrangement of these components, and instructions for a robot to build this assembly. The output must 1) resemble the requested object and 2) be reliably assembled by a 6 DoF robot arm with a suction gripper. We then present Blox-Net, a GDfRA system that combines generative vision language models with well-established methods in computer vision, simulation, perturbation analysis, motion planning, and physical robot experimentation to solve a class of GDfRA problems with minimal human supervision. Blox-Net achieved a Top-1 accuracy of 63.5% in the ''recognizability'' of its designed assemblies (eg, resembling giraffe as judged by a VLM). These designs, after automated perturbation redesign, were reliably assembled by a robot, achieving near-perfect success across 10 consecutive assembly iterations with human intervention only during reset prior to assembly. Surprisingly, this entire design process from textual word (''giraffe'') to reliable physical assembly is performed with zero human intervention.

Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset

TL;DR

Abstract

Paper Structure (17 sections, 7 figures, 3 tables)

This paper contains 17 sections, 7 figures, 3 tables.

Introduction
Related Work
Design for Robot Assembly
Text-to-Shape Generation
Robot Task Planning with Foundation Models
GDfRA Problem
Method
Phase I: VLM Design and Selection
Phase II: Perturbation-Based Redesign
Phase III: Robot Assembly and Evaluation
Experiments
Semantic Recognizability
Constructability
Perturbation Redesign Ablation
Implementation Details
...and 2 more sections

Figures (7)

Figure 1: Can a vision-language model generate designs suitable for robot assembly?Blox-Net is a GDfRA system that produces 3D designs constructible by robots subject to physical material constraints. (a) Starting with a phrase (e.g., "giraffe") and a set of blocks, (b) Blox-Net iteratively prompts GPT-4o to generate designs, using simulation to verify stability. (c) A physical robot then assembles the design to test stability and constructibility, (d) resulting in the successful assembly of the design.
Figure 2: Overview of Blox-Net. We present a multi-stage framework for producing physically constructible models based on a user-specified prompt. The Blox-Net pipeline begins with a natural language input and JSON detailing the available blocks. These parameters are passed into a series of VLM prompts, beginning with a high-level overview (Describe), followed by requesting specific blocks to use in construction (Plan) and a sequence to place them in (Order). Finally, the VLM generates the initial design and enters a feedback loop, continuously receiving visual and stability feedback from the simulator. After generating 10 candidate designs, a separate VLM selects the best structure through head-to-head image comparisons. The perturbation redesign phase then adjusts the selected structure to enhance its physical constructability before it is assembled by a real robot.
Figure 3: Diverse Design Generations:Left: Blox-Net generates a diverse set of candidate designs and uses the VLM (GPT 4o) to select the most suitable one. Right: Blox-Net accurately generates a variety of structural designs, adhering to specific input constraints. A set of 10 designs can be generated in 81 seconds, and the selection of the best design takes an additional 60 seconds.
Figure 4: Block Reorientation: The robot first places the block into a 90 degree angle bracket. Then, the block is regrasped on a different face, achieving a 90 degree rotation.
Figure 5: Task Execution: We present Blox-Net VLM generated designs assembled by a robot paired with simulation renderings
...and 2 more figures

Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset

TL;DR

Abstract

Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset

Authors

TL;DR

Abstract

Table of Contents

Figures (7)