REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

Zhaoyuan Gu; Yipu Chen; Zimeng Chai; Alfred Cueva; Thong Nguyen; Yifan Wu; Huishu Xue; Minji Kim; Isaac Legene; Fukang Liu; Matthew Kim; Ayan Barula; Yongxin Chen; Ye Zhao

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

Zhaoyuan Gu, Yipu Chen, Zimeng Chai, Alfred Cueva, Thong Nguyen, Yifan Wu, Huishu Xue, Minji Kim, Isaac Legene, Fukang Liu, Matthew Kim, Ayan Barula, Yongxin Chen, Ye Zhao

Abstract

Humanoid loco-manipulation requires coordinated high-level motion plans with stable, low-level whole-body execution under complex robot-environment dynamics and long-horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low-level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high-dimensional humanoid systems. To address this challenge, we present REFINE-DP (REinforcement learning FINE-tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high-level planner and an RL-based low-level loco-manipulation controller. The DP is fine-tuned via a PPO-based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE-DP on a humanoid robot performing loco-manipulation tasks, including door traversal and long-horizon object transport. REFINE-DP achieves an over $90\%$ success rate in simulation, even in out-of-distribution cases not seen in the pre-trained data, and enables smooth autonomous task execution in real-world dynamic environments. Our proposed method substantially outperforms pre-trained DP baselines and demonstrates that RL fine-tuning is key to reliable humanoid loco-manipulation. https://refine-dp.github.io/REFINE-DP/

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

Abstract

success rate in simulation, even in out-of-distribution cases not seen in the pre-trained data, and enables smooth autonomous task execution in real-world dynamic environments. Our proposed method substantially outperforms pre-trained DP baselines and demonstrates that RL fine-tuning is key to reliable humanoid loco-manipulation. https://refine-dp.github.io/REFINE-DP/

Paper Structure (16 sections, 9 equations, 7 figures, 1 algorithm)

This paper contains 16 sections, 9 equations, 7 figures, 1 algorithm.

Introduction
Related Works
Sim-to-Real RL for Humanoid Loco-manipulation Control
Autonomous Humanoid Loco-Manipulation
Fine-tuning Pre-trained Imitation Learning Policy
Methods
Training Loco-manipulation Controller for Data Collection
Diffusion Policy Pre-training
Diffusion Policy Fine-tuning
Joint Optimization of the Diffusion and the RL Policies
Experiments
Experiment Setup
Baselines Methods and Ablation Study
Quantitative Results and Analysis
Hardware Experiment
...and 1 more sections

Figures (7)

Figure 1: Our pipeline consists of three stages: (a) Data Collection, where expert demonstrations are collected in a source environment using a frozen RL-based loco-manipulation controller; (b) Pre-training, where a diffusion policy is trained with the expert dataset containing human-demonstrated skills; and (c) Joint Optimization, where the pre-trained diffusion policy is jointly fine-tuned with the low-level controller in a target environment, enabling the robot to refine its loco-manipulation skills through RL to achieve a higher success rate and better motion tracking quality.
Figure 2: The sim-to-real RL of the loco-manipulation control policy, from motion prior generation to deployment.
Figure 3: The robot autonomously executes loco-manipulation tasks: (a) door opening, (b) object transportation.
Figure 4: Comparison of success rates across tasks in Sec. \ref{['sec:experiment']} and methods in Sec. \ref{['sec:baseline']}.
Figure 5: Joint optimization achieves better end effector tracking performances and velocities across tasks in Sec. \ref{['sec:experiment']}.
...and 2 more figures

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

Abstract

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (7)