Table of Contents
Fetching ...

Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control

Huayu Chen, Kaiwen Zheng, Hang Su, Jun Zhu

TL;DR

This work describes offline Reinforcement Learning as a two-stage optimization problem: First pretraining expressive generative policies on reward-free behavior datasets, then fine-tuning these policies to align with task-specific annotations like Q-values.

Abstract

Drawing upon recent advances in language model alignment, we formulate offline Reinforcement Learning as a two-stage optimization problem: First pretraining expressive generative policies on reward-free behavior datasets, then fine-tuning these policies to align with task-specific annotations like Q-values. This strategy allows us to leverage abundant and diverse behavior data to enhance generalization and enable rapid adaptation to downstream tasks using minimal annotations. In particular, we introduce Efficient Diffusion Alignment (EDA) for solving continuous control problems. EDA utilizes diffusion models for behavior modeling. However, unlike previous approaches, we represent diffusion policies as the derivative of a scalar neural network with respect to action inputs. This representation is critical because it enables direct density calculation for diffusion models, making them compatible with existing LLM alignment theories. During policy fine-tuning, we extend preference-based alignment methods like Direct Preference Optimization (DPO) to align diffusion behaviors with continuous Q-functions. Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. Notably, EDA maintains about 95\% of performance and still outperforms several baselines given only 1\% of Q-labelled data during fine-tuning.

Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control

TL;DR

This work describes offline Reinforcement Learning as a two-stage optimization problem: First pretraining expressive generative policies on reward-free behavior datasets, then fine-tuning these policies to align with task-specific annotations like Q-values.

Abstract

Drawing upon recent advances in language model alignment, we formulate offline Reinforcement Learning as a two-stage optimization problem: First pretraining expressive generative policies on reward-free behavior datasets, then fine-tuning these policies to align with task-specific annotations like Q-values. This strategy allows us to leverage abundant and diverse behavior data to enhance generalization and enable rapid adaptation to downstream tasks using minimal annotations. In particular, we introduce Efficient Diffusion Alignment (EDA) for solving continuous control problems. EDA utilizes diffusion models for behavior modeling. However, unlike previous approaches, we represent diffusion policies as the derivative of a scalar neural network with respect to action inputs. This representation is critical because it enables direct density calculation for diffusion models, making them compatible with existing LLM alignment theories. During policy fine-tuning, we extend preference-based alignment methods like Direct Preference Optimization (DPO) to align diffusion behaviors with continuous Q-functions. Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. Notably, EDA maintains about 95\% of performance and still outperforms several baselines given only 1\% of Q-labelled data during fine-tuning.
Paper Structure (23 sections, 5 theorems, 44 equations, 11 figures, 2 tables)

This paper contains 23 sections, 5 theorems, 44 equations, 11 figures, 2 tables.

Key Result

Proposition 3.1

(Proof in Appendix appendix:proof) Let $f_\theta^*$ be the optimal solution of Problem eq:CEP_DPO_loss and $\pi_{t, \theta}^* \propto e^{f_\theta^*}$ be the optimal diffusion policy. Assuming unlimited model capacity and data samples, we have the following results: (a) Optimality Guarantee. At time (b) Diffusion Consistency. At time $t>0$, $\pi_{t>0, \theta}$ models the diffused distribution of $

Figures (11)

  • Figure 1: Comparison between alignment strategies for LLMs and diffusion policies (ours).
  • Figure 2: Algorithm overview. Left: In behavior pretraining, the diffusion behavior model is represented as the derivative of a scalar neural network with respect to action inputs. The scalar outputs of the network can later be utilized to estimate behavior density. Right: In policy fine-tuning, we predict the optimality of actions in a contrastive manner among $K$ candidates. The prediction logit for each action is the density gap between the learned policy model and the frozen behavior model. We use cross-entropy loss to align prediction logits $\triangle f_\theta := f_\theta^\pi- f_\theta^\mu$ with dataset Q-labels.
  • Figure 3: Experimental results of EDA in 2D bandit settings at different diffusion times. Column 1: Visualization of diversified behavior datasets. Each dot represents a two-dimensional behavioral action. Its color reflects the action's Q-value. Column 2 & 3: Density maps of the action distribution as estimated by the pretrained or fine-tuned BDM models. The density for low-Q-value actions has been effectively decreased after fine-tuning. Column 4: The predicted action Q-values, calculated by Eq. \ref{['eq:modelQ']}, align with dataset Q-values in Column 1. See appendix \ref{['appendix:toy_more']} for complete results.
  • Figure 4: Average performance of EDA combined with different Q-learning methods in Locomotion tasks.
  • Figure 5: Aligning pretrained diffusion behaviors with task Q-functions is fast and data-efficient.
  • ...and 6 more figures

Theorems & Definitions (9)

  • Proposition 3.1
  • Lemma C.1
  • proof
  • Lemma C.2
  • proof
  • Lemma C.3
  • proof
  • Proposition C.4
  • proof