Table of Contents
Fetching ...

"Set It Up!": Functional Object Arrangement with Compositional Generative Models

Yiqing Xu, Jiayuan Mao, Yilun Du, Tomas Lozáno-Pérez, Leslie Pack Kaelbling, David Hsu

TL;DR

The paper introduces SetItUp, a neuro-symbolic framework for functional object arrangement under under-specified instructions. It combines an LLM-powered abstract spatial-relations generator with a library of compositional diffusion models to ground relations into concrete object poses, organized via a ground factor graph. By encoding tasks with a small, human-designed sketch and five few-shot examples per scene type, SetItUp achieves strong generalization to unseen objects and instructions across study desks, dining tables, and coffee tables, outperforming end-to-end diffusion and LLM-only baselines in physical feasibility, functionality, and aesthetics. The approach offers data-efficient, scalable grounding for complex tabletop arrangements, with potential for continual learning and broader task-family extension.

Abstract

This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as "set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as "put object A on the table." We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.

"Set It Up!": Functional Object Arrangement with Compositional Generative Models

TL;DR

The paper introduces SetItUp, a neuro-symbolic framework for functional object arrangement under under-specified instructions. It combines an LLM-powered abstract spatial-relations generator with a library of compositional diffusion models to ground relations into concrete object poses, organized via a ground factor graph. By encoding tasks with a small, human-designed sketch and five few-shot examples per scene type, SetItUp achieves strong generalization to unseen objects and instructions across study desks, dining tables, and coffee tables, outperforming end-to-end diffusion and LLM-only baselines in physical feasibility, functionality, and aesthetics. The approach offers data-efficient, scalable grounding for complex tabletop arrangements, with potential for continual learning and broader task-family extension.

Abstract

This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as "set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as "put object A on the table." We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.
Paper Structure (36 sections, 8 equations, 10 figures, 16 tables)

This paper contains 36 sections, 8 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: (a) At test time, given human instruction and a set of objects (possibly unseen during training), our framework SetItUp first generates a set of multi-ary spatial relationships among subsets of objects. These spatial relationships are based on a library of abstract spatial relationships and are visualized in the multi-ary graphical representation. (b) Then, we employ a compositional diffusion model to generate concrete object poses that a robot can execute based on a general motion planner.
  • Figure 2: Overall architecture of SetItUp. Given a novel instruction $\textit{desc}$ and a set of objects $\mathcal{O}$, we first query an LLM to induce an abstract spatial relationship description of the target object arrangements. The input to the LLM also includes a handful of training examples ${\mathcal{D}}$ and a human-defined task-family sketch. Next, we ground these abstract relationships into object poses by composing a library of diffusion models to generate object poses that simultaneously comply with all proposed spatial relationships.
  • Figure 3: Training a single constraint diffusion model involves a two-stage process. First, for every abstract relationship listed in Table \ref{['tab:relationships']}, we generate a synthetic dataset based on predefined rules. Then, we train a relation-specific diffusion model that can draw samples of object poses that satisfy the relationship.
  • Figure 4: Abstract relationship generation through rule induction involves two phases. Initially, in the program induction phase, we employ an LLM to create a "setup" rule-based program from a few training examples and a high-level task-family sketch defined by humans. This program contains rules and patterns for various subproblems, but it has unbound variables (i.e., the actual objects and instructions are not specified yet). In the second phase, with a new instruction and a list of test objects, the LLM binds these variables to the induced program to create an executable Python program. This executable Python program is then used to generate the final set of abstract spatial relationships as a ground graph.
  • Figure 5: Example process of using an LLM to instantiate a program sketch. Sub-figure (a) presents an example of the initial program sketch. We provide this program sketch, along with five training instances, to the LLM. The LLM then creates a rule-based program, summarizing the common patterns in the form of code comments and/or templates, but with unbound variables, as illustrated in (b). Finally, given new objects and instructions, the LLM binds these variables to the induced program and generates an executable Python program. This program is then used to generate the object grounding graph. An example of an executable Python program with variable bindings during inference time is depicted in (c).
  • ...and 5 more figures