Table of Contents
Fetching ...

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang, Yuze Zhai, Xiaocong Li

TL;DR

Offline-to-online reinforcement learning for robotics suffers from limited offline multimodal coverage and distribution shifts during online adaptation. UEPO addresses this by integrating a state-conditioned diffusion policy with multi-seed sampling to capture diverse action sequences, a dynamic divergence constraint guided by diffusion sampling, and diffusion-based data augmentation to train the dynamics model on a larger, physically consistent dataset, formalized with $p(a_{1:T}|s_{1:T})$ and a sequence-level objective $J( abla\pi)$. It introduces a velocity/acceleration-based divergence metric $\text{div}(a_i,a_j)$ with adaptive perturbation and a sequence-level KL regularizer to promote both local and global diversity. On the D4RL benchmark, UEPO achieves +5.9% absolute improvement over Uni-O4 on locomotion and +12.4% on dexterous manipulation, demonstrating strong generalization, stability, and data efficiency for robust robot learning $\left( p(a_{1:T}|s_{1:T})\,,\, \hat{T}(s'|s,a)\right)$. This framework offers practical impact by reducing reliance on expert demonstrations while enabling scalable, safe deployment of learned policies in dynamic real-world settings.

Abstract

Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

TL;DR

Offline-to-online reinforcement learning for robotics suffers from limited offline multimodal coverage and distribution shifts during online adaptation. UEPO addresses this by integrating a state-conditioned diffusion policy with multi-seed sampling to capture diverse action sequences, a dynamic divergence constraint guided by diffusion sampling, and diffusion-based data augmentation to train the dynamics model on a larger, physically consistent dataset, formalized with and a sequence-level objective . It introduces a velocity/acceleration-based divergence metric with adaptive perturbation and a sequence-level KL regularizer to promote both local and global diversity. On the D4RL benchmark, UEPO achieves +5.9% absolute improvement over Uni-O4 on locomotion and +12.4% on dexterous manipulation, demonstrating strong generalization, stability, and data efficiency for robust robot learning . This framework offers practical impact by reducing reliance on expert demonstrations while enabling scalable, safe deployment of learned policies in dynamic real-world settings.

Abstract

Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.

Paper Structure

This paper contains 9 sections, 2 equations, 1 figure, 1 table, 1 algorithm.

Figures (1)

  • Figure 1: UEPO employs a multi-seed diffusion sampling strategy to initialize components for the subsequent phase. During the offline optimization stage (middle), the strategy enhances diversity through a regularization mechanism that amplifies policy divergence, whilst simultaneously training the dynamics model $\hat{T}$ using a joint approach of real data and synthetic trajectories to improve its generalization capability. Finally, a qualifying policy is selected as the initialization for online fine-tuning.