What Do You Need for Compositional Generalization in Diffusion Planning?
Quentin Clark, Florian Shkurti
TL;DR
Diffusion planners can stitch sub-trajectories to generate novel behaviors without TD learning by leveraging locality and position-invariant reasoning. The study shows two core enablers of compositionality—local receptive fields and shift equivariance—while incorporating inference strategies, data augmentation, and data scaling as supportive factors. The authors introduce Eq-Net, a simple CNN-based denoiser with small receptive fields that achieves diverse, goal-conditioned trajectories and rivals more compute-intensive approaches. Across navigation and manipulation tasks, Eq-Net demonstrates strong compositional generalization, providing practical guidance for designing diffusion planners that stitch by design and enabling more steerable, goal-directed planning in robotics.
Abstract
In policy learning, stitching and compositional generalization refer to the extent to which the policy is able to piece together sub-trajectories of data it is trained on to generate new and diverse behaviours. While stitching has been identified as a significant strength of offline reinforcement learning, recent generative behavioural cloning (BC) methods have also shown proficiency at stitching. However, the main factors behind this are poorly understood, hindering the development of new algorithms that can reliably stitch by design. Focusing on diffusion planners trained via generative behavioural cloning, and without resorting to dynamic programming or TD-learning, we find three properties are key enablers for composition: shift equivariance, local receptive fields, and inference choices. We use these properties to explain architecture, data, and inference choices in existing generative BC methods based on diffusion planning including replanning frequency, data augmentation, and data scaling. Our experiments show that while local receptive fields are more important than shift equivariance in creating a diffusion planner capable of composition, both are crucial. Using findings from our experiments, we develop a new architecture for diffusion planners called Eq-Net, that is simple, produces diverse trajectories competitive with more computationally expensive methods such as replanning or scaling data, and can be guided to enable generalization in goal-conditioned settings. We show that Eq-Net exhibits significant compositional generalization in a variety of navigation and manipulation tasks designed to test planning diversity.
