StackGen: Generating Stable Structures from Silhouettes via Diffusion

Luzhe Sun; Takuma Yoneda; Samuel W. Wheeler; Tianchong Jiang; Matthew R. Walter

StackGen: Generating Stable Structures from Silhouettes via Diffusion

Luzhe Sun, Takuma Yoneda, Samuel W. Wheeler, Tianchong Jiang, Matthew R. Walter

TL;DR

StackGen addresses the challenge of constructing stable 3D block structures from 2D silhouettes by learning a conditional diffusion model over SE(3) poses. A Transformer-based denoiser, guided by a silhouette-conditioned conditioning signal and a CNN-predicted block list, generates diverse, physically feasible block configurations without modeling forward dynamics explicitly. The approach is trained on a large synthetic dataset created via construction by deconstruction and validated through simulation and real-world UR5 robot experiments, outperforming several baselines in stability and silhouette fidelity while maintaining diversity. This data-driven pipeline enables scalable, robust, and generalizable shape-to-structure synthesis from simple visual prompts, bridging visual design and physical assembly.

Abstract

Humans naturally obtain intuition about the interactions between and the stability of rigid objects by observing and interacting with the world. It is this intuition that governs the way in which we regularly configure objects in our environment, allowing us to build complex structures from simple, everyday objects. Robotic agents, on the other hand, traditionally require an explicit model of the world that includes the detailed geometry of each object and an analytical model of the environment dynamics, which are difficult to scale and preclude generalization. Instead, robots would benefit from an awareness of intuitive physics that enables them to similarly reason over the stable interaction of objects in their environment. Towards that goal, we propose StackGen, a diffusion model that generates diverse stable configurations of building blocks matching a target silhouette. To demonstrate the capability of the method, we evaluate it in a simulated environment and deploy it in the real setting using a robotic arm to assemble structures generated by the model.

StackGen: Generating Stable Structures from Silhouettes via Diffusion

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 7 figures, 1 table)

This paper contains 20 sections, 6 equations, 7 figures, 1 table.

Introduction
Related Work
Learning Stability from Intuitive Physics
Diffusion Models for Pose Generation
Automated Sequential Assembly
Method
Diffusion Models for SE(3) Block Pose Generation
Model Architecture
Generating Data
Experiments
Evaluation in simulation
Brute-Force Baseline
Greedy-Random Baseline
Transformer-Regression Baseline
Transformer-VAE Baseline
...and 5 more sections

Figures (7)

Figure 1: StackGen consists of a diffusion model that takes as inputs a target structure silhouette and a list of available block shapes. The model then generates a set of block poses $\{\hat{\bm{p}}_{1}, \ldots \hat{\bm{p}}_k\}$ that construct a stable structure consistent with the target silhouette. The resulting structure can then be constructed using a robot arm.
Figure 2: A visualization of StackGen's Transformer-based architecture. We first embed noisy poses $\Tilde{\bm{p}}_{1:k}^t$, shapes $\bm{s}_{1:k}$ and timestep $t$ to same dimension $d$, then add them together to construct object tokens $\in \mathbb{R}^{k\times d}$. Then we patchify the silhouette and embed them to $\bm{i}_{1:c}$, sum up with the positional encoding $\textrm{Pos}_{1:c}$ to construct silhouette tokens $\in \mathbb{R}^{c\times d}$. Then we feed object, padding and silhouette tokens together to our denoising Transformer $D_\theta$ (Eq. \ref{['eq:denoiser']}) in the diagram to predict noise $\hat{\bm{\epsilon}}_{1:k}^t$. By running Eq.\ref{['eq:one-step-denoise']} to get next reverse state $\Tilde{\bm{p}}_{1:k}^{t-1}$.
Figure 3: Our strategy (left) to generate diverse set of stable stacks. After filling the design grid with shapes, we verify stack stability in simulator, and begin removing each block, saving the stable stacks. The right part shows some challenging examples in the dataset.
Figure 4: A (left) reference (i.e., ground-truth) stack with its silhouette, and (right) a diverse set of structures generated from the silhouette by our model.
Figure 5: Our learning-based baselines include (left) Tran-VAE: a Transformer-based VAE model with a two-layer encoder and four-layer decoder, and (right) Tran-Reg: a Transformer-based regression model.
...and 2 more figures

StackGen: Generating Stable Structures from Silhouettes via Diffusion

TL;DR

Abstract

StackGen: Generating Stable Structures from Silhouettes via Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (7)