Table of Contents
Fetching ...

AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners

Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, Ping Luo

TL;DR

AdaptDiffuser introduces a self-evolving diffusion-based planner for offline RL that generates diverse, high-quality synthetic demonstrations guided by reward gradients, filters them with a discriminator, and fine-tunes the diffusion model to improve planning on seen tasks and generalize to unseen tasks without extra expert data. By incorporating reward-to-go conditioning and dynamics-consistency constraints, it achieves notable gains over prior diffusion planners on Maze2D and MuJoCo benchmarks and demonstrates zero-shot adaptation to new tasks like KUKA pick-and-place. The work includes extensive ablations on iterative data generation, data sufficiency, and model size, and discusses practical considerations such as training-time costs and potential extensions to high-dimensional observations and diverse maze layouts. Overall, AdaptDiffuser provides a robust framework for adaptive, task-general diffusion-based planning in offline settings with meaningful real-world implications for autonomous robots and goal-conditioned control.

Abstract

Diffusion models have demonstrated their powerful generative capability in many tasks, with great potential to serve as a paradigm for offline reinforcement learning. However, the quality of the diffusion model is limited by the insufficient diversity of training data, which hinders the performance of planning and the generalizability to new tasks. This paper introduces AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, not only for seen tasks but can also adapt to unseen tasks. AdaptDiffuser enables the generation of rich synthetic expert data for goal-conditioned tasks using guidance from reward gradients. It then selects high-quality data via a discriminator to finetune the diffusion model, which improves the generalization ability to unseen tasks. Empirical experiments on two benchmark environments and two carefully designed unseen tasks in KUKA industrial robot arm and Maze2D environments demonstrate the effectiveness of AdaptDiffuser. For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20.8% on Maze2D and 7.5% on MuJoCo locomotion, but also adapts better to new tasks, e.g., KUKA pick-and-place, by 27.9% without requiring additional expert data. More visualization results and demo videos could be found on our project page.

AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners

TL;DR

AdaptDiffuser introduces a self-evolving diffusion-based planner for offline RL that generates diverse, high-quality synthetic demonstrations guided by reward gradients, filters them with a discriminator, and fine-tunes the diffusion model to improve planning on seen tasks and generalize to unseen tasks without extra expert data. By incorporating reward-to-go conditioning and dynamics-consistency constraints, it achieves notable gains over prior diffusion planners on Maze2D and MuJoCo benchmarks and demonstrates zero-shot adaptation to new tasks like KUKA pick-and-place. The work includes extensive ablations on iterative data generation, data sufficiency, and model size, and discusses practical considerations such as training-time costs and potential extensions to high-dimensional observations and diverse maze layouts. Overall, AdaptDiffuser provides a robust framework for adaptive, task-general diffusion-based planning in offline settings with meaningful real-world implications for autonomous robots and goal-conditioned control.

Abstract

Diffusion models have demonstrated their powerful generative capability in many tasks, with great potential to serve as a paradigm for offline reinforcement learning. However, the quality of the diffusion model is limited by the insufficient diversity of training data, which hinders the performance of planning and the generalizability to new tasks. This paper introduces AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, not only for seen tasks but can also adapt to unseen tasks. AdaptDiffuser enables the generation of rich synthetic expert data for goal-conditioned tasks using guidance from reward gradients. It then selects high-quality data via a discriminator to finetune the diffusion model, which improves the generalization ability to unseen tasks. Empirical experiments on two benchmark environments and two carefully designed unseen tasks in KUKA industrial robot arm and Maze2D environments demonstrate the effectiveness of AdaptDiffuser. For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20.8% on Maze2D and 7.5% on MuJoCo locomotion, but also adapts better to new tasks, e.g., KUKA pick-and-place, by 27.9% without requiring additional expert data. More visualization results and demo videos could be found on our project page.
Paper Structure (39 sections, 3 theorems, 23 equations, 11 figures, 12 tables)

This paper contains 39 sections, 3 theorems, 23 equations, 11 figures, 12 tables.

Key Result

Lemma 1.1

The marginal probability of a conditional Markov's noising process $q$ conditioned on $y$ is equal to the marginal probability of the unconditional noising process.

Figures (11)

  • Figure 1: Overall framework and performance comparison of AdaptDiffuser. It enables diffusion models to generate rich synthetic expert data using guidance from reward gradients of either seen or unseen goal-conditioned tasks. Then, it iteratively selects high-quality data via a discriminator to finetune the diffusion model for self-evolving, leading to improved performance on seen tasks and better generalizability to unseen tasks.
  • Figure 2: Overall framework of AdaptDiffuser. To improve the adaptability of the diffusion model to diverse tasks, rich data with distinct objectives is generated, guided by each task’s reward function. During the diffusion denoising process, we utilize a pre-trained denoising U-Net to progressively generate high-quality trajectories. At each denoising time step, we take the task-specific reward of a trajectory to adjust the gradient of state and action sequence, thereby creating trajectories that align with specific task objectives. Subsequently, the generated synthetic trajectory is evaluated by a discriminator to see if it meets the standards. If yes, it is incorporated into a data pool to fine-tune the diffusion model. The procedure iteratively enhances the generalizability of our model for both seen and unseen settings.
  • Figure 3: Hard Cases of Maze2D with Long Planning Path. Paths are generated in the Maze2D environment with a specified start and goal condition.
  • Figure 4: Maze2d Navigation with Gold Coin Picking Task. Subfigures (a) (b) show the optimal path when there are no gold coins in the Maze. (The generated routes walk at the bottom of the Maze.) And subfigures (c) (d) add additional reward in (4,2) position of the Maze. The planners generate new paths that pass through the gold coin as shown in subfigures (c) (d). (The newly generated routes walk in the middle of the maze.)
  • Figure 5: Visualization of KUKA Pick-and-Place Task. We require the KUKA Arm to move the blocks from their random initialized positions on the right side of the table to the left and arrange them in the order of yellow, blue, green, and red (from near to far).
  • ...and 6 more figures

Theorems & Definitions (6)

  • Lemma 1.1
  • proof
  • Lemma 1.2
  • proof
  • Theorem 1.3
  • proof