Table of Contents
Fetching ...

What Do You Need for Compositional Generalization in Diffusion Planning?

Quentin Clark, Florian Shkurti

TL;DR

Diffusion planners can stitch sub-trajectories to generate novel behaviors without TD learning by leveraging locality and position-invariant reasoning. The study shows two core enablers of compositionality—local receptive fields and shift equivariance—while incorporating inference strategies, data augmentation, and data scaling as supportive factors. The authors introduce Eq-Net, a simple CNN-based denoiser with small receptive fields that achieves diverse, goal-conditioned trajectories and rivals more compute-intensive approaches. Across navigation and manipulation tasks, Eq-Net demonstrates strong compositional generalization, providing practical guidance for designing diffusion planners that stitch by design and enabling more steerable, goal-directed planning in robotics.

Abstract

In policy learning, stitching and compositional generalization refer to the extent to which the policy is able to piece together sub-trajectories of data it is trained on to generate new and diverse behaviours. While stitching has been identified as a significant strength of offline reinforcement learning, recent generative behavioural cloning (BC) methods have also shown proficiency at stitching. However, the main factors behind this are poorly understood, hindering the development of new algorithms that can reliably stitch by design. Focusing on diffusion planners trained via generative behavioural cloning, and without resorting to dynamic programming or TD-learning, we find three properties are key enablers for composition: shift equivariance, local receptive fields, and inference choices. We use these properties to explain architecture, data, and inference choices in existing generative BC methods based on diffusion planning including replanning frequency, data augmentation, and data scaling. Our experiments show that while local receptive fields are more important than shift equivariance in creating a diffusion planner capable of composition, both are crucial. Using findings from our experiments, we develop a new architecture for diffusion planners called Eq-Net, that is simple, produces diverse trajectories competitive with more computationally expensive methods such as replanning or scaling data, and can be guided to enable generalization in goal-conditioned settings. We show that Eq-Net exhibits significant compositional generalization in a variety of navigation and manipulation tasks designed to test planning diversity.

What Do You Need for Compositional Generalization in Diffusion Planning?

TL;DR

Diffusion planners can stitch sub-trajectories to generate novel behaviors without TD learning by leveraging locality and position-invariant reasoning. The study shows two core enablers of compositionality—local receptive fields and shift equivariance—while incorporating inference strategies, data augmentation, and data scaling as supportive factors. The authors introduce Eq-Net, a simple CNN-based denoiser with small receptive fields that achieves diverse, goal-conditioned trajectories and rivals more compute-intensive approaches. Across navigation and manipulation tasks, Eq-Net demonstrates strong compositional generalization, providing practical guidance for designing diffusion planners that stitch by design and enabling more steerable, goal-directed planning in robotics.

Abstract

In policy learning, stitching and compositional generalization refer to the extent to which the policy is able to piece together sub-trajectories of data it is trained on to generate new and diverse behaviours. While stitching has been identified as a significant strength of offline reinforcement learning, recent generative behavioural cloning (BC) methods have also shown proficiency at stitching. However, the main factors behind this are poorly understood, hindering the development of new algorithms that can reliably stitch by design. Focusing on diffusion planners trained via generative behavioural cloning, and without resorting to dynamic programming or TD-learning, we find three properties are key enablers for composition: shift equivariance, local receptive fields, and inference choices. We use these properties to explain architecture, data, and inference choices in existing generative BC methods based on diffusion planning including replanning frequency, data augmentation, and data scaling. Our experiments show that while local receptive fields are more important than shift equivariance in creating a diffusion planner capable of composition, both are crucial. Using findings from our experiments, we develop a new architecture for diffusion planners called Eq-Net, that is simple, produces diverse trajectories competitive with more computationally expensive methods such as replanning or scaling data, and can be guided to enable generalization in goal-conditioned settings. We show that Eq-Net exhibits significant compositional generalization in a variety of navigation and manipulation tasks designed to test planning diversity.

Paper Structure

This paper contains 55 sections, 33 figures, 9 tables.

Figures (33)

  • Figure 1: We provide an analysis that identifies the critical design decisions that enable diffusion planners to exhibit stitching and compositional generalization via generative behaviour cloning, without resorting to dynamic programming or TD learning, as commonly done in offline RL.
  • Figure 2: Common Diffusion Backbones Memorize: We define memorization as new trajectories only consisting of sub-skills previously seen before in the same trajectory. High memorization rates for both U-Net and DiT architectures used as diffusion planners in a variety of environments shows that composition cannot be taken for granted in diffusion planning. See Appendix \ref{['appendix:why_janner_no_compose']} for more discussion.
  • Figure 3: The methods to enable trajectory composition that we examine. The left shows replanning (an inference technique), the middle positional augmentation (a dataset augmentation technique) and the right architectural modification. Each of these enables composition, showing that the broad principles of local receptive fields and shift equivariance can be applied in different contexts.
  • Figure 4: The environments we use in our experiments. From top left to bottom right: Maze, Didactic, Well-Plate Real, Lights, Block-Stack, Well-Plate Sim. Lights picture is taken from OGBench park2024ogbench.
  • Figure 5: Architecture choices enable compositionality. Experiments across all environments show using local and positionally equivariant architecture substantially increases compositionality, with a greater number of novel trajectories.
  • ...and 28 more figures