Table of Contents
Fetching ...

CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning

Zeyuan Liu, Kai Yang, Xiu Li

TL;DR

CDSA introduces a plug-and-play conservatism mechanism for offline RL by learning gradient fields of the dataset density via denoising score matching and using them to generate auxiliary action components. The method decouples conservatism from policy training and employs an inverse dynamics model to obtain state-gradient guidance, enabling action corrections that steer trajectories toward high-density, safer regions without retraining the baseline offline policy. Empirical results on D4RL MuJoCo and AntMaze tasks show CDSA improves final performance and significantly enhances risk-averse measures such as VaR, with strong generalization across tasks. This approach provides a scalable, theory-grounded way to mitigate distribution shift in offline RL and can be integrated with a wide range of baselines.

Abstract

Distribution shift is a major obstacle in offline reinforcement learning, which necessitates minimizing the discrepancy between the learned policy and the behavior policy to avoid overestimating rare or unseen actions. Previous conservative offline RL algorithms struggle to generalize to unseen actions, despite their success in learning good in-distribution policy. In contrast, we propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions. We decouple the conservatism constraints from the policy, thus can benefit wide offline RL algorithms. As a consequence, we propose the Conservative Denoising Score-based Algorithm (CDSA) which utilizes the denoising score-based model to model the gradient of the dataset density, rather than the dataset density itself, and facilitates a more accurate and efficient method to adjust the action generated by the pre-trained policy in a deterministic and continuous MDP environment. In experiments, we show that our approach significantly improves the performance of baseline algorithms in D4RL datasets, and demonstrate the generalizability and plug-and-play capability of our model across different pre-trained offline RL policy in different tasks. We also validate that the agent exhibits greater risk aversion after employing our method while showcasing its ability to generalize effectively across diverse tasks.

CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning

TL;DR

CDSA introduces a plug-and-play conservatism mechanism for offline RL by learning gradient fields of the dataset density via denoising score matching and using them to generate auxiliary action components. The method decouples conservatism from policy training and employs an inverse dynamics model to obtain state-gradient guidance, enabling action corrections that steer trajectories toward high-density, safer regions without retraining the baseline offline policy. Empirical results on D4RL MuJoCo and AntMaze tasks show CDSA improves final performance and significantly enhances risk-averse measures such as VaR, with strong generalization across tasks. This approach provides a scalable, theory-grounded way to mitigate distribution shift in offline RL and can be integrated with a wide range of baselines.

Abstract

Distribution shift is a major obstacle in offline reinforcement learning, which necessitates minimizing the discrepancy between the learned policy and the behavior policy to avoid overestimating rare or unseen actions. Previous conservative offline RL algorithms struggle to generalize to unseen actions, despite their success in learning good in-distribution policy. In contrast, we propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions. We decouple the conservatism constraints from the policy, thus can benefit wide offline RL algorithms. As a consequence, we propose the Conservative Denoising Score-based Algorithm (CDSA) which utilizes the denoising score-based model to model the gradient of the dataset density, rather than the dataset density itself, and facilitates a more accurate and efficient method to adjust the action generated by the pre-trained policy in a deterministic and continuous MDP environment. In experiments, we show that our approach significantly improves the performance of baseline algorithms in D4RL datasets, and demonstrate the generalizability and plug-and-play capability of our model across different pre-trained offline RL policy in different tasks. We also validate that the agent exhibits greater risk aversion after employing our method while showcasing its ability to generalize effectively across diverse tasks.
Paper Structure (22 sections, 1 theorem, 13 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 1 theorem, 13 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

The loss $\mathcal{L}_\theta$ in Eq. (4) is equivalent to the following loss: where $\mathbf{z}=(0, z)$ and $z\sim N(0,I)$.

Figures (7)

  • Figure 1: The Conservative Denoising Score-based Algorithm (CDSA) leverages conservatism-related knowledge to enhance the performance of offline RL algorithms. As depicted in (a), the original RL algorithm generates actions based on the current state to interact with the environment. To address the distribution shift problem, we propose to generate auxiliary actions based on the current action-state pair, guiding the entire trajectory towards high-density regions of the training dataset. This is illustrated in (b), where CDSA generates two action components, utilizing conservatism-related knowledge acquired from the training dataset, to be added to the action generated from a pre-trained policy $\pi$.
  • Figure 2: An example comparing CDSA control and common control. CDSA controls more closely to areas of distribution for action decision making.
  • Figure 3: Experiments in the Risky PointMass environment are depicted in (a), where the red circle represents the risky zone, leading to negative rewards if occupied. The agent begins at the blue point, targeting the purple circle. In (b) and (c), we employ simple shortest path finding and CQL as baseline algorithms, demonstrating the corrective impact of our method. CDSA learns from an offline dataset generated by a pretrained CODAC agent, following the identical training procedure outlined in its official repository ma2021conservative. Maroon trajectories illustrate baseline algorithm results, while black trajectories depict agents equipped with CDSA. Our method displays two sets of direction arrows: green indicating baseline algorithm directions, and blue indicating conservative auxiliary action directions from CDSA. The arrow length signifies action magnitude. With CDSA modifications, the agent effectively avoids the risky region.
  • Figure 4: Results of VaR(the $n^{th}$ percentile of cumulative sorted reward). Here only shows the results of CDSA (IQL) (the blue lines) and IQL (the green lines) in the D4RL benchmark. The results of the POR and CDSA (POR) algorithms are shown in Appendix \ref{['section:B.1']}
  • Figure 5: The results of risky transportation experiment. (a) shows the map of the environment. is the start point and is the target point, the trajectories of the agent are represented by several lines that increase saturation with the number of steps. is river, is mountain, and is ice, which are risky regions. (b) shows the states in the offline dataset. The black color represents $\sum_{a} p_{\text{data}}(s,a)=0$ and the white color represents $\sum_{a} p_{\text{data}}(s,a)$ is non-zero in the dataset. (c) is the gradient field of states learned from CDSA. We only show the gradient field of states since the gradient field of actions is hard to present. (d) is the results of SAC and CDSA (SAC). After employing CDSA to maintain the agent within the known region, the agent exhibits a higher degree of risk aversion. (e) is the task that requires the agent to bring goods to the target point, is the region where the goods are placed. (f) shows the results after adding the airport to this environment, is the airport area. In these two tasks, we use the CDSA models learned from the path finding task without any fine-tuning. The agent avoids all risky regions after using CDSA.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Lemma 1