Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks

Edwin Zhang; Yujie Lu; Shinda Huang; William Wang; Amy Zhang

Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks

Edwin Zhang, Yujie Lu, Shinda Huang, William Wang, Amy Zhang

TL;DR

LCD addresses the challenge of scaling generalist agents by unifying language-conditioned instruction with long-horizon planning via hierarchical diffusion. It introduces a high-level diffusion policy conditioned on language and uses a frozen low-level policy encoder to execute plans, enabling efficient planning in a latent space with DDIM and temporal abstraction. Theoretical near-optimality guarantees are provided under mild Lipschitz assumptions, and empirical results on CALVIN and CLEVR-Robot show state-of-the-art performance and 3.3x–15x inference speedups. This work advances practical long-horizon, language-guided control, offering a scalable path toward generalist agents capable of handling space, time, and task diversity.

Abstract

Training generalist agents is difficult across several axes, requiring us to deal with high-dimensional inputs (space), long horizons (time), and generalization to novel tasks. Recent advances with architectures have allowed for improved scaling along one or two of these axes, but are still computationally prohibitive to use. In this paper, we propose to address all three axes by leveraging \textbf{L}anguage to \textbf{C}ontrol \textbf{D}iffusion models as a hierarchical planner conditioned on language (LCD). We effectively and efficiently scale diffusion models for planning in extended temporal, state, and task dimensions to tackle long horizon control problems conditioned on natural language instructions, as a step towards generalist agents. Comparing LCD with other state-of-the-art models on the CALVIN language robotics benchmark finds that LCD outperforms other SOTA methods in multi-task success rates, whilst improving inference speed over other comparable diffusion models by 3.3x~15x. We show that LCD can successfully leverage the unique strength of diffusion models to produce coherent long range plans while addressing their weakness in generating low-level details and control.

Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks

TL;DR

Abstract

Paper Structure (70 sections, 2 theorems, 17 equations, 23 figures, 7 tables, 1 algorithm)

This paper contains 70 sections, 2 theorems, 17 equations, 23 figures, 7 tables, 1 algorithm.

Introduction
Background
Reinforcement Learning.
Language Conditioned RL.
Goal Conditioned Imitation Learning and Hierarchical RL.
The Language Control Diffusion (LCD) Framework
Hierarchical Diffusion Policies
High-level Diffusion Policy Objective.
Near Optimality Guarantees.
Practical Instantiation
High-Level Policy.
Low-Level Policy.
Model Architecture.
Experiments
Experimental Setup
...and 55 more sections

Key Result

Proposition 3.1

If the transition function $p(s'|s, a)$ is Lipschitz continuous with constant $K_f$ and $\sup_{s\in S, a\in A} | \pi_{\mathrm{lo}}(s) - a^* | \le \epsilon$, then

Figures (23)

Figure 1: An overview of our high-level policy training pipeline. The frozen low-level policy encoder is used to encode a latent plan, or a subsampled sequence of RGB observations encoded into a lower dimensional latent space (1), which will be used later on as goals for the goal-conditioned low-level policy (LLP). We then noise this latent plan according to a uniformly sampled timestep from the diffusion process' variance schedule (2), and train a Temporal U-Net conditioned on natural language embeddings from a frozen upstream large language model to reverse the noising process (3), effectively learning how to conditionally denoise the latent plan. To train the U-Net, one can simply use the $p$-norm between the predicted latent plan and the ground truth latent plan as the loss (4). We use $p=1$ in practice following Janner2022.
Figure 2: Denoised Latent Representations. Directly using latent diffusion models fails. Hallucination occurs on a $\beta$-TC VAE trained from scratch on the CALVIN dataset (Diffuser-1D), and loss of fine details occurs with SD v1.4's rombach2022high internet-scale pretrained autoencoder (Diffuser-2D). For more and enlarged samples please refer to \ref{['appendix:representation-failure']}.
Figure 3: An overview of our Denoising process. In Figure \ref{['fig:lad2d_denoising']} and Figure \ref{['fig:diffusion_generation']}, we give an example of the denoising process of one of our ablations, the Diffuser-2D model. This model utilizes the 2D autoencoder of rombach2022high with Janner2022.
Figure 4: Diffusion Loss Comparison. Here we give study how varying the Diffusion model's size changes the performance of the model. As can be seen, scaling the model from 64 hidden dimensions to 128 strictly increases generation quality, and would likely follow scaling laws observed in kaplan2020scaling.
Figure 5: The Evaluation Task Distribution. We visualize the distribution of all the tasks considered in our experiments in Figure \ref{['fig:task_distribution']}. Note the long-tailedness of this distribution, and how it skews evaluation scores upwards if one can solve the relatively easier tasks that occur most frequently, such as Open Drawer, Move Slider Right, and Move Slider Left. These tasks only deal with static objects, meaning there is very little generalization that is needed in order to solve these tasks when compared to other block tasks involving randomized block positions.
...and 18 more figures

Theorems & Definitions (3)

Proposition 3.1
proof
Theorem G.1

Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks

TL;DR

Abstract

Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (3)