"Set It Up!": Functional Object Arrangement with Compositional Generative Models
Yiqing Xu, Jiayuan Mao, Yilun Du, Tomas Lozáno-Pérez, Leslie Pack Kaelbling, David Hsu
TL;DR
The paper introduces SetItUp, a neuro-symbolic framework for functional object arrangement under under-specified instructions. It combines an LLM-powered abstract spatial-relations generator with a library of compositional diffusion models to ground relations into concrete object poses, organized via a ground factor graph. By encoding tasks with a small, human-designed sketch and five few-shot examples per scene type, SetItUp achieves strong generalization to unseen objects and instructions across study desks, dining tables, and coffee tables, outperforming end-to-end diffusion and LLM-only baselines in physical feasibility, functionality, and aesthetics. The approach offers data-efficient, scalable grounding for complex tabletop arrangements, with potential for continual learning and broader task-family extension.
Abstract
This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as "set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as "put object A on the table." We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.
