Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks
Edwin Zhang, Yujie Lu, Shinda Huang, William Wang, Amy Zhang
TL;DR
LCD addresses the challenge of scaling generalist agents by unifying language-conditioned instruction with long-horizon planning via hierarchical diffusion. It introduces a high-level diffusion policy conditioned on language and uses a frozen low-level policy encoder to execute plans, enabling efficient planning in a latent space with DDIM and temporal abstraction. Theoretical near-optimality guarantees are provided under mild Lipschitz assumptions, and empirical results on CALVIN and CLEVR-Robot show state-of-the-art performance and 3.3x–15x inference speedups. This work advances practical long-horizon, language-guided control, offering a scalable path toward generalist agents capable of handling space, time, and task diversity.
Abstract
Training generalist agents is difficult across several axes, requiring us to deal with high-dimensional inputs (space), long horizons (time), and generalization to novel tasks. Recent advances with architectures have allowed for improved scaling along one or two of these axes, but are still computationally prohibitive to use. In this paper, we propose to address all three axes by leveraging \textbf{L}anguage to \textbf{C}ontrol \textbf{D}iffusion models as a hierarchical planner conditioned on language (LCD). We effectively and efficiently scale diffusion models for planning in extended temporal, state, and task dimensions to tackle long horizon control problems conditioned on natural language instructions, as a step towards generalist agents. Comparing LCD with other state-of-the-art models on the CALVIN language robotics benchmark finds that LCD outperforms other SOTA methods in multi-task success rates, whilst improving inference speed over other comparable diffusion models by 3.3x~15x. We show that LCD can successfully leverage the unique strength of diffusion models to produce coherent long range plans while addressing their weakness in generating low-level details and control.
