Table of Contents
Fetching ...

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Zechu Li, Rickmer Krohn, Tao Chen, Anurag Ajay, Pulkit Agrawal, Georgia Chalvatzaki

TL;DR

This work introduces Deep Diffusion Policy Gradient (DDiffPG), an actor-critic framework that learns multimodal policies from scratch by parameterizing policies as diffusion models. It explicitly discovers, preserves, and controls multiple behavioral modes through unsupervised trajectory clustering, mode-specific Q-functions, and mode embeddings, while mitigating greediness with a multimodal training batch. A diffusion-policy gradient is derived using a target action $a^{target} = a + \eta \nabla_{a} Q(s,a)$ and a BC-style loss, enabling stable online RL updates. Empirical results on high-dimensional, sparse-reward tasks (AntMaze and robotic manipulation) show that DDiffPG masters multimodal behaviors, promotes exploration to escape local minima, and enables online replanning, albeit with higher computational cost than some baselines. This approach offers a principled path to versatile, controllable policies in nonstationary and long-horizon scenarios.

Abstract

Deep reinforcement learning (RL) algorithms typically parameterize the policy as a deep network that outputs either a deterministic action or a stochastic one modeled as a Gaussian distribution, hence restricting learning to a single behavioral mode. Meanwhile, diffusion models emerged as a powerful framework for multimodal learning. However, the use of diffusion policies in online RL is hindered by the intractability of policy likelihood approximation, as well as the greedy objective of RL methods that can easily skew the policy to a single mode. This paper presents Deep Diffusion Policy Gradient (DDiffPG), a novel actor-critic algorithm that learns from scratch multimodal policies parameterized as diffusion models while discovering and maintaining versatile behaviors. DDiffPG explores and discovers multiple modes through off-the-shelf unsupervised clustering combined with novelty-based intrinsic motivation. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective, ensuring the improvement of the diffusion policy across all modes. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes. Empirical studies validate DDiffPG's capability to master multimodal behaviors in complex, high-dimensional continuous control tasks with sparse rewards, also showcasing proof-of-concept dynamic online replanning when navigating mazes with unseen obstacles.

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

TL;DR

This work introduces Deep Diffusion Policy Gradient (DDiffPG), an actor-critic framework that learns multimodal policies from scratch by parameterizing policies as diffusion models. It explicitly discovers, preserves, and controls multiple behavioral modes through unsupervised trajectory clustering, mode-specific Q-functions, and mode embeddings, while mitigating greediness with a multimodal training batch. A diffusion-policy gradient is derived using a target action and a BC-style loss, enabling stable online RL updates. Empirical results on high-dimensional, sparse-reward tasks (AntMaze and robotic manipulation) show that DDiffPG masters multimodal behaviors, promotes exploration to escape local minima, and enables online replanning, albeit with higher computational cost than some baselines. This approach offers a principled path to versatile, controllable policies in nonstationary and long-horizon scenarios.

Abstract

Deep reinforcement learning (RL) algorithms typically parameterize the policy as a deep network that outputs either a deterministic action or a stochastic one modeled as a Gaussian distribution, hence restricting learning to a single behavioral mode. Meanwhile, diffusion models emerged as a powerful framework for multimodal learning. However, the use of diffusion policies in online RL is hindered by the intractability of policy likelihood approximation, as well as the greedy objective of RL methods that can easily skew the policy to a single mode. This paper presents Deep Diffusion Policy Gradient (DDiffPG), a novel actor-critic algorithm that learns from scratch multimodal policies parameterized as diffusion models while discovering and maintaining versatile behaviors. DDiffPG explores and discovers multiple modes through off-the-shelf unsupervised clustering combined with novelty-based intrinsic motivation. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective, ensuring the improvement of the diffusion policy across all modes. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes. Empirical studies validate DDiffPG's capability to master multimodal behaviors in complex, high-dimensional continuous control tasks with sparse rewards, also showcasing proof-of-concept dynamic online replanning when navigating mazes with unseen obstacles.
Paper Structure (21 sections, 1 equation, 12 figures, 4 tables, 2 algorithms)

This paper contains 21 sections, 1 equation, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: We design (Top) four AntMaze tasks; AntMaze-v1, AntMaze-v2, AntMaze-v3, AntMaze-v4, and (Below) four robotic tasks Reach, Peg-in-hole, Drawer-close, and Cabinet-open that have a high degree of multimodality.
  • Figure 2: Overview of DDiffPG: (1) the agent interacts with the environment and collects a set of trajectories $\{\tau_i\}$. (2) Given a set of goal-reached trajectories, a DTW distance matrix is computed and used for hierarchical clustering to discover modes. (3) Each mode is associated with a set of trajectories, which is used exclusively to train mode-specific Q-functions and an exploration-specific $Q_{\texttt{explore}}$. (4) A multimodal batch is constructed by concatenating $(s, a^{target})$ pairs sampled from every mode and used for the diffusion policy update.
  • Figure 3: Hierarchical clustering on AntMaze-v3 (Left) and AntMaze-v4 (Right). Each color represents a mode.
  • Figure 4: Performance of DDiffPG and baseline methods in the four AntMaze environments.
  • Figure 5: Exploration maps of DDiffPG and baselines in AntMaze-v3.
  • ...and 7 more figures