Table of Contents
Fetching ...

Regularized Conditional Diffusion Model for Multi-Task Preference Alignment

Xudong Yu, Chenjia Bai, Haoran He, Changhong Wang, Xuelong Li

TL;DR

This work adopts multi-task preferences as a unified condition for both single- and multi-task decision-making, and proposes preference representations aligned with preference labels, and introduces an auxiliary objective to maximize the mutual information between representations and corresponding generated trajectories, improving alignment between trajectories and preferences.

Abstract

Sequential decision-making is desired to align with human intents and exhibit versatility across various tasks. Previous methods formulate it as a conditional generation process, utilizing return-conditioned diffusion models to directly model trajectory distributions. Nevertheless, the return-conditioned paradigm relies on pre-defined reward functions, facing challenges when applied in multi-task settings characterized by varying reward functions (versatility) and showing limited controllability concerning human preferences (alignment). In this work, we adopt multi-task preferences as a unified condition for both single- and multi-task decision-making, and propose preference representations aligned with preference labels. The learned representations are used to guide the conditional generation process of diffusion models, and we introduce an auxiliary objective to maximize the mutual information between representations and corresponding generated trajectories, improving alignment between trajectories and preferences. Extensive experiments in D4RL and Meta-World demonstrate that our method presents favorable performance in single- and multi-task scenarios, and exhibits superior alignment with preferences.

Regularized Conditional Diffusion Model for Multi-Task Preference Alignment

TL;DR

This work adopts multi-task preferences as a unified condition for both single- and multi-task decision-making, and proposes preference representations aligned with preference labels, and introduces an auxiliary objective to maximize the mutual information between representations and corresponding generated trajectories, improving alignment between trajectories and preferences.

Abstract

Sequential decision-making is desired to align with human intents and exhibit versatility across various tasks. Previous methods formulate it as a conditional generation process, utilizing return-conditioned diffusion models to directly model trajectory distributions. Nevertheless, the return-conditioned paradigm relies on pre-defined reward functions, facing challenges when applied in multi-task settings characterized by varying reward functions (versatility) and showing limited controllability concerning human preferences (alignment). In this work, we adopt multi-task preferences as a unified condition for both single- and multi-task decision-making, and propose preference representations aligned with preference labels. The learned representations are used to guide the conditional generation process of diffusion models, and we introduce an auxiliary objective to maximize the mutual information between representations and corresponding generated trajectories, improving alignment between trajectories and preferences. Extensive experiments in D4RL and Meta-World demonstrate that our method presents favorable performance in single- and multi-task scenarios, and exhibits superior alignment with preferences.
Paper Structure (63 sections, 2 theorems, 16 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 63 sections, 2 theorems, 16 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Proposition 3.1

The optimization objective in Equation eq:7 can be transformed to

Figures (7)

  • Figure 1: Illustration of the return-conditioned generation of Decision Diffuser decisiondiffuser in hopper-medium-expert task. Existing return-conditional diffusion models fail to align generated trajectories with the return condition, while the red line indicates the desired relationship between the return conditions and true returns of generated trajectories yuan2024reward.
  • Figure 2: Illustration of the representation space of trajectories in multi-task preference data. For each task $i$, the positive samples $\tau^+$ consist of preferred trajectories $\tau^{i+}$ from task $i$, while negative samples $\tau^-$ include less preferred $\tau^{i-}$ from the same task, as well as $\tau^j$ from other tasks. Trajectories from diverse tasks are expected to be differentiated in the representation space, and $\{w^*_i\}_{i\in[m]}$ attempts to characterize the best trajectories for each task.
  • Figure 3: Overview of our method. (1) We learn preference representations $w=f_\psi(\tau)$ and the optimal one $w^*_i$ from trajectory segments $\tau$, which comprise positive samples $\tau^+$ and negative samples $\tau^-$. (2) We augment the diffusion model with an auxiliary mutual information term $I(\tau_0;w)$ to ensure the alignment between $\tau_0$ and $w$. (3) During the inference process, the diffusion model conditioned on $w^*_i$ can generate desired trajectories aligned with preferences.
  • Figure 4: Average success rates in MT-10 benchmarks trained with different datasets. Orange bars are reward-based methods, while green bars represent preference-based methods. Detailed comparisons for each task can be found in §\ref{['app:more_result']}.
  • Figure 5: Left: Brighter dots indicate trajectories with higher returns. Red dots represent each dimension of $w^*_i$. Black triangles in (b) mark trajectories with the highest return. $f_\psi$ can separate trajectories from different tasks and with different returns. $w^*_i$ aligns with the optimal trajectories for each task. Right: Guided by $w^*_i$, diffusion models can generate trajectories $\tau_0^*$ that mainly lie around $w^*_i$ (shown as black circles), which represents better trajectories in offline data $\tau_0$.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • Lemma A.1