Dual Alignment Maximin Optimization for Offline Model-based RL

Chi Zhou; Wang Luo; Haoran Li; Congying Han; Tiande Guo; Zicheng Zhang

Dual Alignment Maximin Optimization for Offline Model-based RL

Chi Zhou, Wang Luo, Haoran Li, Congying Han, Tiande Guo, Zicheng Zhang

TL;DR

This work tackles distribution mismatch in offline model-based RL by addressing policy discrepancies that arise when learned dynamics diverge from real environments. It introduces Dual Alignment Maximin Optimization (DAMO), a unified actor-critic framework that performs inner minimization for dual conservative value estimation and outer maximization for consistent policy improvement, augmented by a classifier-based data-alignment term to bridge synthetic and offline data. Theoretical results show the maximin objective lower-bounds the standard RL objective, ensuring objective-consistent updates and conservatism during learning. Empirically, DAMO delivers competitive or state-of-the-art results on D4RL and NeoRL benchmarks, demonstrates effective mitigation of OOD actions and states, and exhibits robust generalization under limited data coverage and noisy settings, with realistic hyperparameter sensitivity analyses.

Abstract

Offline reinforcement learning agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. While most prior research has focused on improving the fidelity of synthetic sampling and incorporating off-policy mechanisms, the directly integrated paradigm often fails to ensure consistent policy behavior in biased models and underlying environmental dynamics, which inherently arise from discrepancies between behavior and learning policies. In this paper, we first shift the focus from model reliability to policy discrepancies while optimizing for expected returns, and then self-consistently incorporate synthetic data, deriving a novel actor-critic paradigm, Dual Alignment Maximin Optimization (DAMO). It is a unified framework to ensure both model-environment policy consistency and synthetic and offline data compatibility. The inner minimization performs dual conservative value estimation, aligning policies and trajectories to avoid out-of-distribution states and actions, while the outer maximization ensures that policy improvements remain consistent with inner value estimates. Empirical evaluations demonstrate that DAMO effectively ensures model and policy alignments, achieving competitive performance across diverse benchmark tasks.

Dual Alignment Maximin Optimization for Offline Model-based RL

TL;DR

Abstract

Dual Alignment Maximin Optimization for Offline Model-based RL

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)