Text-Aware Diffusion for Policy Learning

Calvin Luo; Mandy He; Zilai Zeng; Chen Sun

Text-Aware Diffusion for Policy Learning

Calvin Luo, Mandy He, Zilai Zeng, Chen Sun

TL;DR

Text-Aware Diffusion for Policy Learning (TADPoLe) is proposed, which uses a pretrained, frozen text-conditioned diffusion model to compute dense zero-shot reward signals for text-aligned policy learning and is able to learn policies for novel goal-achievement and continuous locomotion behaviors specified by natural language, in both Humanoid and Dog environments.

Abstract

Training an agent to achieve particular goals or perform desired behaviors is often accomplished through reinforcement learning, especially in the absence of expert demonstrations. However, supporting novel goals or behaviors through reinforcement learning requires the ad-hoc design of appropriate reward functions, which quickly becomes intractable. To address this challenge, we propose Text-Aware Diffusion for Policy Learning (TADPoLe), which uses a pretrained, frozen text-conditioned diffusion model to compute dense zero-shot reward signals for text-aligned policy learning. We hypothesize that large-scale pretrained generative models encode rich priors that can supervise a policy to behave not only in a text-aligned manner, but also in alignment with a notion of naturalness summarized from internet-scale training data. In our experiments, we demonstrate that TADPoLe is able to learn policies for novel goal-achievement and continuous locomotion behaviors specified by natural language, in both Humanoid and Dog environments. The behaviors are learned zero-shot without ground-truth rewards or expert demonstrations, and are qualitatively more natural according to human evaluation. We further show that TADPoLe performs competitively when applied to robotic manipulation tasks in the Meta-World environment, without having access to any in-domain demonstrations.

Text-Aware Diffusion for Policy Learning

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 21 sections, 6 equations, 16 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Method
Text-Aware Diffusion for Policy Learning
TADPoLe with Text-to-Video Diffusion Models
Experiments
Experimental Setup and Evaluation
Goal Achievement
Continuous Locomotion
Robotic Manipulation
Normalization Study
Conclusion and Future Work
Intuition Regarding A Reasonable Noise Level Range
Noise Level for Video-TADPoLe
Detailed Hyperparameters
...and 6 more sections

Figures (16)

Figure 1: Our proposed Text-Aware Diffusion for Policy Learning (TADPoLe) framework leverages frozen, pretrained text-aware diffusion models to automatically craft dense text-conditioned rewards for policy learning. Here we visualize TADPoLe achieving diverse text-conditioned goals in the Humanoid, Dog, and Meta-World environments.
Figure 2: A policy $\pi_{\theta}$ that interacts with an environment can be treated as an agent-centric implicit video representation, where the arrow of time is actuated by the agent's actions and the pixels are rendered by the environment. The rendered behaviors can then be evaluated by a text-aware diffusion model to produce dense rewards, thereby providing text-conditioned update signals to the policy.
Figure 3: An illustration of the TADPoLe pipeline, which computes text-conditioned rewards for policy learning through a pretrained, frozen diffusion model. At each timestep, the subsequent frame rendered through the environment is corrupted with a sampled Gaussian source noise vector $\mathbf{\epsilon}_0$. The pretrained text-conditioned diffusion model then predicts the source noise that was added. The reward is designed to be large when the selected action produces frames well-aligned with the text prompt.
Figure 5: Episode return curves for a Humanoid agent trained with Video-TADPoLe, using the prompt "a person walking". We observe that the Video-TADPoLe reward signal (left) is positively correlated with the agent's performance as measured with ground-truth reward during training (middle) and evaluation (right). Shaded regions denote the standard deviation across five random seeds.
Figure A1: Noise range intuition for a fixed image but two distinct prompts (left), and for a fixed prompt but two distinct images (right). Through visualization, we verify that $U(400, 500)$ is a reasonable range from which to sample noise levels that can meaningfully distinguish vision-text alignment for arbitrarily rendered frames.
...and 11 more figures

Text-Aware Diffusion for Policy Learning

TL;DR

Abstract

Text-Aware Diffusion for Policy Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)