Table of Contents
Fetching ...

Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning

Tianle Zhang, Jiayi Guan, Lin Zhao, Yihang Li, Dongjiang Li, Zecui Zeng, Lei Sun, Yue Chen, Xuelong Wei, Lusong Li, Xiaodong He

TL;DR

This work tackles offline reinforcement learning with diffusion-based policies, arguing that weighted regression limits policy improvement due to data scarcity and Q-value noise. It introduces PAO-DP, a diffusion-policy framework that automatically generates preferred actions from a critic and employs anti-noise preference optimization via a Bradley-Terry model to guide policy improvement without WR. PAO-DP achieves competitive or superior performance on the D4RL suite, particularly in sparse-reward tasks like Kitchen and AntMaze, and demonstrates robustness to Q-value estimation noise. The approach highlights the value of preference-based action selection to stabilize and enhance diffusion-based offline RL, with potential extensions to trajectory-level preferences.

Abstract

Offline reinforcement learning (RL) aims to learn optimal policies from previously collected datasets. Recently, due to their powerful representational capabilities, diffusion models have shown significant potential as policy models for offline RL issues. However, previous offline RL algorithms based on diffusion policies generally adopt weighted regression to improve the policy. This approach optimizes the policy only using the collected actions and is sensitive to Q-values, which limits the potential for further performance enhancement. To this end, we propose a novel preferred-action-optimized diffusion policy for offline RL. In particular, an expressive conditional diffusion model is utilized to represent the diverse distribution of a behavior policy. Meanwhile, based on the diffusion model, preferred actions within the same behavior distribution are automatically generated through the critic function. Moreover, an anti-noise preference optimization is designed to achieve policy improvement by using the preferred actions, which can adapt to noise-preferred actions for stable training. Extensive experiments demonstrate that the proposed method provides competitive or superior performance compared to previous state-of-the-art offline RL methods, particularly in sparse reward tasks such as Kitchen and AntMaze. Additionally, we empirically prove the effectiveness of anti-noise preference optimization.

Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning

TL;DR

This work tackles offline reinforcement learning with diffusion-based policies, arguing that weighted regression limits policy improvement due to data scarcity and Q-value noise. It introduces PAO-DP, a diffusion-policy framework that automatically generates preferred actions from a critic and employs anti-noise preference optimization via a Bradley-Terry model to guide policy improvement without WR. PAO-DP achieves competitive or superior performance on the D4RL suite, particularly in sparse-reward tasks like Kitchen and AntMaze, and demonstrates robustness to Q-value estimation noise. The approach highlights the value of preference-based action selection to stabilize and enhance diffusion-based offline RL, with potential extensions to trajectory-level preferences.

Abstract

Offline reinforcement learning (RL) aims to learn optimal policies from previously collected datasets. Recently, due to their powerful representational capabilities, diffusion models have shown significant potential as policy models for offline RL issues. However, previous offline RL algorithms based on diffusion policies generally adopt weighted regression to improve the policy. This approach optimizes the policy only using the collected actions and is sensitive to Q-values, which limits the potential for further performance enhancement. To this end, we propose a novel preferred-action-optimized diffusion policy for offline RL. In particular, an expressive conditional diffusion model is utilized to represent the diverse distribution of a behavior policy. Meanwhile, based on the diffusion model, preferred actions within the same behavior distribution are automatically generated through the critic function. Moreover, an anti-noise preference optimization is designed to achieve policy improvement by using the preferred actions, which can adapt to noise-preferred actions for stable training. Extensive experiments demonstrate that the proposed method provides competitive or superior performance compared to previous state-of-the-art offline RL methods, particularly in sparse reward tasks such as Kitchen and AntMaze. Additionally, we empirically prove the effectiveness of anti-noise preference optimization.
Paper Structure (23 sections, 25 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 25 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: RAT curves of the ablation methods on representative tasks. The ordinate is the average normalized score.
  • Figure 2: RAT curves for different methods with different $\xi$ on the Kitchen-complete task. The ordinate is the average normalized score.
  • Figure 3: Overall design of the proposed method
  • Figure 4: Performance of PAO-DP variants in Kitchen-complete and AntMaze-large-play environments, comparing different sampling strategies with $\lambda = 0.0$ and $\lambda = 0.4$.
  • Figure 5: Impact of varying $\lambda$ values on PAO-DP performance in Kitchen-complete and AntMaze-large-play environments.