StackGen: Generating Stable Structures from Silhouettes via Diffusion
Luzhe Sun, Takuma Yoneda, Samuel W. Wheeler, Tianchong Jiang, Matthew R. Walter
TL;DR
StackGen addresses the challenge of constructing stable 3D block structures from 2D silhouettes by learning a conditional diffusion model over SE(3) poses. A Transformer-based denoiser, guided by a silhouette-conditioned conditioning signal and a CNN-predicted block list, generates diverse, physically feasible block configurations without modeling forward dynamics explicitly. The approach is trained on a large synthetic dataset created via construction by deconstruction and validated through simulation and real-world UR5 robot experiments, outperforming several baselines in stability and silhouette fidelity while maintaining diversity. This data-driven pipeline enables scalable, robust, and generalizable shape-to-structure synthesis from simple visual prompts, bridging visual design and physical assembly.
Abstract
Humans naturally obtain intuition about the interactions between and the stability of rigid objects by observing and interacting with the world. It is this intuition that governs the way in which we regularly configure objects in our environment, allowing us to build complex structures from simple, everyday objects. Robotic agents, on the other hand, traditionally require an explicit model of the world that includes the detailed geometry of each object and an analytical model of the environment dynamics, which are difficult to scale and preclude generalization. Instead, robots would benefit from an awareness of intuitive physics that enables them to similarly reason over the stable interaction of objects in their environment. Towards that goal, we propose StackGen, a diffusion model that generates diverse stable configurations of building blocks matching a target silhouette. To demonstrate the capability of the method, we evaluate it in a simulated environment and deploy it in the real setting using a robotic arm to assemble structures generated by the model.
