Table of Contents
Fetching ...

DIAR: Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation

Jaehyun Park, Yunho Kim, Sejin Kim, Byung-Jun Lee, Sundong Kim

TL;DR

This work proposes a novel offline reinforcement learning approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework, and addresses Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model.

Abstract

We propose a novel offline reinforcement learning (offline RL) approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework. We address two key challenges in offline RL: out-of-distribution samples and long-horizon problems. We leverage diffusion models to learn state-action sequence distributions and incorporate value functions for more balanced and adaptive decision-making. DIAR introduces an Adaptive Revaluation mechanism that dynamically adjusts decision lengths by comparing current and future state values, enabling flexible long-term decision-making. Furthermore, we address Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model. The diffusion model generates diverse latent trajectories, enhancing policy robustness and generalization. As demonstrated in tasks like Maze2D, AntMaze, and Kitchen, DIAR consistently outperforms state-of-the-art algorithms in long-horizon, sparse-reward environments.

DIAR: Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation

TL;DR

This work proposes a novel offline reinforcement learning approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework, and addresses Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model.

Abstract

We propose a novel offline reinforcement learning (offline RL) approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework. We address two key challenges in offline RL: out-of-distribution samples and long-horizon problems. We leverage diffusion models to learn state-action sequence distributions and incorporate value functions for more balanced and adaptive decision-making. DIAR introduces an Adaptive Revaluation mechanism that dynamically adjusts decision lengths by comparing current and future state values, enabling flexible long-term decision-making. Furthermore, we address Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model. The diffusion model generates diverse latent trajectories, enhancing policy robustness and generalization. As demonstrated in tasks like Maze2D, AntMaze, and Kitchen, DIAR consistently outperforms state-of-the-art algorithms in long-horizon, sparse-reward environments.

Paper Structure

This paper contains 26 sections, 12 equations, 7 figures, 6 tables, 4 algorithms.

Figures (7)

  • Figure 1: Performance comparison across D4RL environments with long-horizon and sparse-reward tasks, specifically Maze2D. Our method, DIAR, consistently outperforms other diffusion-based planning frameworks, including Diffuser and LDCQ.
  • Figure 2: DIAR-generated trajectories in challenging Maze2D situations. DIAR reliably reaches the goal even from starting points (blue) that are far from the goal (red). DIAR shows strong performance regardless of starting position.
  • Figure 3: Three training stages of DIAR. (a) The $\beta$-VAE is trained by encoding a state-action sequence spanning an $\mathit{H}$-length horizon into a latent space, followed by a policy decoder that outputs actions based on the encoded latent $\bm{z}$ and the state $\bm{s}_{t}$ contained within it. (b) A diffusion model is trained using the encoded latent and the initial state $\bm{s}_t$. (c) The Q-network is trained on the offline dataset, while the value network is trained on data generated by the diffusion model. This interplay allows the value function and Q-function to guide each other, enabling more balanced learning across both offline samples and generated data.
  • Figure 4: Inference step with DIAR. The current state $\bm{s}_t$ is put into the diffusion model to extract candidate latent vectors. Then, the latent vector $\bm{z}_t$ with the highest $Q(\bm{s}_t,\bm{z}_t)$ is selected as the best latent vector. This latent vector $\bm{z}_t$ is subsequently decoded to generate the action $\bm{a}_t$. Additionally, the future state $\bm{s}_{t+H}$ is also decoded to be used for calculating the future value $V(\bm{s}_{t+H})$.
  • Figure 5: The process of finding a better trajectory using Adaptive Revaluation. The process involves making a decision and taking action based on the skill latent $\bm{z}_t$ with the highest $Q(\bm{s}_t,\bm{z}_t)$. The current latent vector $\bm{z}_t$ is used to predict the future state $\bm{s}_{t+H}$, based on which the value $V(\bm{s}_{t+H})$ of the future state $\bm{s}_{t+H}$ is calculated. (a) If the value $V(\bm{s}_{t})$ of the current state $\bm{s}_{t}$ is greater than the value $V(\bm{s}_{t+H})$ of the future state $\bm{s}_{t+H}$, it is considered non-ideal, and re-sampling is performed. (b) If the value $V(\bm{s}_{t+H})$ of the future state $\bm{s}_{t+H}$ is greater than or equal to the value $V(\bm{s}_{t})$ of the current state $\bm{s}_{t}$, it is considered ideal, and the action $\bm{a}_t$ decoded by the latent vector $\bm{z}_t$ is executed continuously.
  • ...and 2 more figures