Table of Contents
Fetching ...

FDPP: Fine-tune Diffusion Policy with Human Preference

Yuxin Chen, Devesh K. Jha, Masayoshi Tomizuka, Diego Romeres

TL;DR

Fine-tuning Diffusion Policy with Human Preference (FDPP) learns a reward function through preference-based learning and is used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task.

Abstract

Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback-Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.

FDPP: Fine-tune Diffusion Policy with Human Preference

TL;DR

Fine-tuning Diffusion Policy with Human Preference (FDPP) learns a reward function through preference-based learning and is used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task.

Abstract

Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback-Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.
Paper Structure (21 sections, 16 equations, 23 figures, 2 tables)

This paper contains 21 sections, 16 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Fine-tune Diffusion Policy with Human Preference. Given a pre-trained diffusion policy, FDPP collects trajectory roll-outs and queries human feedback to label pairs of randomly sampled image observations based on human preferences or task specifications. Using these labels, a reward function is trained through preference-based reward learning, which is then used to fine-tune the diffusion policy via reinforcement learning.
  • Figure 2: Environments for Evaluation. To evaluate the effectiveness of FDPP, We choose two long-horizon manipulation tasks including (left) a 2D pushing task Push-Tflorence2022implicitchi2023diffusion and (right) a 3D pick-and-place task Stacking from MimicGenmandlekar2023mimicgen.
  • Figure : (a) Additional Constraints
  • Figure : Push-T
  • Figure : Pre-trained
  • ...and 18 more figures