Table of Contents
Fetching ...

Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners

Wen Zheng Terence Ng, Jianda Chen, Yuan Xu, Tianwei Zhang

TL;DR

This work tackles personalizing diffusion-planner trajectories to individual users by separating learning into a reward-free pretraining stage and a rapid, low-dimensional adaptation stage. Preference Latent Embeddings ($z$) are learned alongside a diffusion model and later aligned to user preferences via a light-weight preference inversion process that optimizes $z$ with minimal labeled data, while the base model remains frozen. Trajectories aligned to user preferences are generated by a sampling scheme that leverages winner/loser embeddings, enabling efficient and stable customization. Empirical results on offline benchmarks and a real-human preference study show superior alignment with human tastes compared to RLHF and LoRA baselines, with practical data efficiency and potential edge deployment implications.

Abstract

This work addresses the challenge of personalizing trajectories generated in automated decision-making systems by introducing a resource-efficient approach that enables rapid adaptation to individual users' preferences. Our method leverages a pretrained conditional diffusion model with Preference Latent Embeddings (PLE), trained on a large, reward-free offline dataset. The PLE serves as a compact representation for capturing specific user preferences. By adapting the pretrained model using our proposed preference inversion method, which directly optimizes the learnable PLE, we achieve superior alignment with human preferences compared to existing solutions like Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA). To better reflect practical applications, we create a benchmark experiment using real human preferences on diverse, high-reward trajectories.

Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners

TL;DR

This work tackles personalizing diffusion-planner trajectories to individual users by separating learning into a reward-free pretraining stage and a rapid, low-dimensional adaptation stage. Preference Latent Embeddings () are learned alongside a diffusion model and later aligned to user preferences via a light-weight preference inversion process that optimizes with minimal labeled data, while the base model remains frozen. Trajectories aligned to user preferences are generated by a sampling scheme that leverages winner/loser embeddings, enabling efficient and stable customization. Empirical results on offline benchmarks and a real-human preference study show superior alignment with human tastes compared to RLHF and LoRA baselines, with practical data efficiency and potential edge deployment implications.

Abstract

This work addresses the challenge of personalizing trajectories generated in automated decision-making systems by introducing a resource-efficient approach that enables rapid adaptation to individual users' preferences. Our method leverages a pretrained conditional diffusion model with Preference Latent Embeddings (PLE), trained on a large, reward-free offline dataset. The PLE serves as a compact representation for capturing specific user preferences. By adapting the pretrained model using our proposed preference inversion method, which directly optimizes the learnable PLE, we achieve superior alignment with human preferences compared to existing solutions like Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA). To better reflect practical applications, we create a benchmark experiment using real human preferences on diverse, high-reward trajectories.

Paper Structure

This paper contains 12 sections, 5 equations, 7 figures.

Figures (7)

  • Figure 1: Overview of personalizing decision-making models. We leverage large-scale offline data for pretraining, followed by rapid and efficient personalization using small-scale preference data.
  • Figure 2: Overview of the proposed method. (Left) Pretraining: A placeholder for preference latent embedding (PLE), $z$, is co-trained with the diffusion model, without reward supervision. (Middle) Adaptation: With diffusion model weights frozen, PLEs are aligned to user labelled query pairs via preference inversion. (Right) Generation: Conditional sampling with learned PLEs generate trajectories that match the users' preference.
  • Figure 3: Latent space analysis: We visualize t-SNE plots of PLEs post-pretraining, where each point represents a trajectory, and color intensity reflects its normalized score. The smooth gradient in return distribution indicates that our pretraining effectively structures the PLE space.
  • Figure 4: Main results evaluated over different numbers of queries across six control tasks report the normalized score.
  • Figure 5: A series of ablation experiments. The average normalized score is reported across all tasks and $N_\text{query}$, except for the loser PLE analysis, where averaging is performed across tasks only.
  • ...and 2 more figures