FDPP: Fine-tune Diffusion Policy with Human Preference

Yuxin Chen; Devesh K. Jha; Masayoshi Tomizuka; Diego Romeres

FDPP: Fine-tune Diffusion Policy with Human Preference

Yuxin Chen, Devesh K. Jha, Masayoshi Tomizuka, Diego Romeres

TL;DR

Fine-tuning Diffusion Policy with Human Preference (FDPP) learns a reward function through preference-based learning and is used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task.

Abstract

Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback-Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.

FDPP: Fine-tune Diffusion Policy with Human Preference

TL;DR

Abstract

Paper Structure (21 sections, 16 equations, 23 figures, 2 tables)

This paper contains 21 sections, 16 equations, 23 figures, 2 tables.

Introduction
Related Work
Diffusion Policy
Preference-based Reward Learning
RL-based Fine-tuning of Diffusion Models
Preliminaries
Markov Decision Process and Reinforcement Learning
Diffusion Model and Diffusion Policy
Fine-tuning Diffusion Policy with Human Preference
Preference-based Reward Learning
RL-based Fine-tuning
KL Regularization
Implementation Details
Experimental Evaluation
Setup
...and 6 more sections

Figures (23)

Figure 1: Fine-tune Diffusion Policy with Human Preference. Given a pre-trained diffusion policy, FDPP collects trajectory roll-outs and queries human feedback to label pairs of randomly sampled image observations based on human preferences or task specifications. Using these labels, a reward function is trained through preference-based reward learning, which is then used to fine-tune the diffusion policy via reinforcement learning.
Figure 2: Environments for Evaluation. To evaluate the effectiveness of FDPP, We choose two long-horizon manipulation tasks including (left) a 2D pushing task Push-Tflorence2022implicitchi2023diffusion and (right) a 3D pick-and-place task Stacking from MimicGenmandlekar2023mimicgen.
Figure : (a) Additional Constraints
Figure : Push-T
Figure : Pre-trained
...and 18 more figures

FDPP: Fine-tune Diffusion Policy with Human Preference

TL;DR

Abstract

FDPP: Fine-tune Diffusion Policy with Human Preference

Authors

TL;DR

Abstract

Table of Contents

Figures (23)