Table of Contents
Fetching ...

D2 Actor Critic: Diffusion Actor Meets Distributional Critic

Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C Stadie

TL;DR

D2AC introduces a model-free reinforcement learning algorithm that pairs a diffusion-based actor with a distributional critic, guided by a stable one-step policy-improvement objective. The distributional critic uses clipped double Q-learning over a categorical return distribution, which stabilizes learning and provides rich value information to the actor. A key theoretical contribution is a one-step lower-bound simplification that enables efficient policy updates without backpropagation through time, bridging diffusion dynamics with conventional value-based updates. Empirically, D2AC achieves state-of-the-art performance across dense and sparse reward tasks, including complex robotic control and a biology-inspired predator–prey benchmark, while maintaining favorable wall-clock efficiency relative to model-based methods. The work suggests that combining distributional value estimation with diffusion-based action proposals can closely approach planning performance in a purely model-free setting, with meaningful implications for exploration and generalization in challenging domains.

Abstract

We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.

D2 Actor Critic: Diffusion Actor Meets Distributional Critic

TL;DR

D2AC introduces a model-free reinforcement learning algorithm that pairs a diffusion-based actor with a distributional critic, guided by a stable one-step policy-improvement objective. The distributional critic uses clipped double Q-learning over a categorical return distribution, which stabilizes learning and provides rich value information to the actor. A key theoretical contribution is a one-step lower-bound simplification that enables efficient policy updates without backpropagation through time, bridging diffusion dynamics with conventional value-based updates. Empirically, D2AC achieves state-of-the-art performance across dense and sparse reward tasks, including complex robotic control and a biology-inspired predator–prey benchmark, while maintaining favorable wall-clock efficiency relative to model-based methods. The work suggests that combining distributional value estimation with diffusion-based action proposals can closely approach planning performance in a purely model-free setting, with meaningful implications for exploration and generalization in challenging domains.

Abstract

We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.

Paper Structure

This paper contains 39 sections, 66 equations, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: D2 Actor Critic uses a critic that models a distribution over possible returns. A diffusion actor uses the expected value of this distribution (the Q-function) to help align the denoising process with policy improvement. Above are visualizations of the Pick-and-Place and Fetch Slide environments.
  • Figure 2: works out of the box across a wide range of environments, including locomotion and manipulation with sparse and dense rewards.
  • Figure 3: Experiments on DeepMind Control Suite. Results over $5$ seeds. In model-free RL, achieves much better sample efficiency and asymptotic performance compared to all other baselines.
  • Figure 4: Comparison between model-based TD-MPC2 TDMPC2, SAC sac, and our method D2AC on DeepMind Control Suite. D2ACwithout planning can achieve results on-par with TD-MPC2.
  • Figure 5: Experiments on Multi-Goal RL environments with sparse rewards. Results over $5$ seeds.
  • ...and 5 more figures

Theorems & Definitions (1)

  • proof