Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning
Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang, Yuze Zhai, Xiaocong Li
TL;DR
Offline-to-online reinforcement learning for robotics suffers from limited offline multimodal coverage and distribution shifts during online adaptation. UEPO addresses this by integrating a state-conditioned diffusion policy with multi-seed sampling to capture diverse action sequences, a dynamic divergence constraint guided by diffusion sampling, and diffusion-based data augmentation to train the dynamics model on a larger, physically consistent dataset, formalized with $p(a_{1:T}|s_{1:T})$ and a sequence-level objective $J( abla\pi)$. It introduces a velocity/acceleration-based divergence metric $\text{div}(a_i,a_j)$ with adaptive perturbation and a sequence-level KL regularizer to promote both local and global diversity. On the D4RL benchmark, UEPO achieves +5.9% absolute improvement over Uni-O4 on locomotion and +12.4% on dexterous manipulation, demonstrating strong generalization, stability, and data efficiency for robust robot learning $\left( p(a_{1:T}|s_{1:T})\,,\, \hat{T}(s'|s,a)\right)$. This framework offers practical impact by reducing reliance on expert demonstrations while enabling scalable, safe deployment of learned policies in dynamic real-world settings.
Abstract
Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
