Table of Contents
Fetching ...

Dual Alignment Maximin Optimization for Offline Model-based RL

Chi Zhou, Wang Luo, Haoran Li, Congying Han, Tiande Guo, Zicheng Zhang

TL;DR

This work tackles distribution mismatch in offline model-based RL by addressing policy discrepancies that arise when learned dynamics diverge from real environments. It introduces Dual Alignment Maximin Optimization (DAMO), a unified actor-critic framework that performs inner minimization for dual conservative value estimation and outer maximization for consistent policy improvement, augmented by a classifier-based data-alignment term to bridge synthetic and offline data. Theoretical results show the maximin objective lower-bounds the standard RL objective, ensuring objective-consistent updates and conservatism during learning. Empirically, DAMO delivers competitive or state-of-the-art results on D4RL and NeoRL benchmarks, demonstrates effective mitigation of OOD actions and states, and exhibits robust generalization under limited data coverage and noisy settings, with realistic hyperparameter sensitivity analyses.

Abstract

Offline reinforcement learning agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. While most prior research has focused on improving the fidelity of synthetic sampling and incorporating off-policy mechanisms, the directly integrated paradigm often fails to ensure consistent policy behavior in biased models and underlying environmental dynamics, which inherently arise from discrepancies between behavior and learning policies. In this paper, we first shift the focus from model reliability to policy discrepancies while optimizing for expected returns, and then self-consistently incorporate synthetic data, deriving a novel actor-critic paradigm, Dual Alignment Maximin Optimization (DAMO). It is a unified framework to ensure both model-environment policy consistency and synthetic and offline data compatibility. The inner minimization performs dual conservative value estimation, aligning policies and trajectories to avoid out-of-distribution states and actions, while the outer maximization ensures that policy improvements remain consistent with inner value estimates. Empirical evaluations demonstrate that DAMO effectively ensures model and policy alignments, achieving competitive performance across diverse benchmark tasks.

Dual Alignment Maximin Optimization for Offline Model-based RL

TL;DR

This work tackles distribution mismatch in offline model-based RL by addressing policy discrepancies that arise when learned dynamics diverge from real environments. It introduces Dual Alignment Maximin Optimization (DAMO), a unified actor-critic framework that performs inner minimization for dual conservative value estimation and outer maximization for consistent policy improvement, augmented by a classifier-based data-alignment term to bridge synthetic and offline data. Theoretical results show the maximin objective lower-bounds the standard RL objective, ensuring objective-consistent updates and conservatism during learning. Empirically, DAMO delivers competitive or state-of-the-art results on D4RL and NeoRL benchmarks, demonstrates effective mitigation of OOD actions and states, and exhibits robust generalization under limited data coverage and noisy settings, with realistic hyperparameter sensitivity analyses.

Abstract

Offline reinforcement learning agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. While most prior research has focused on improving the fidelity of synthetic sampling and incorporating off-policy mechanisms, the directly integrated paradigm often fails to ensure consistent policy behavior in biased models and underlying environmental dynamics, which inherently arise from discrepancies between behavior and learning policies. In this paper, we first shift the focus from model reliability to policy discrepancies while optimizing for expected returns, and then self-consistently incorporate synthetic data, deriving a novel actor-critic paradigm, Dual Alignment Maximin Optimization (DAMO). It is a unified framework to ensure both model-environment policy consistency and synthetic and offline data compatibility. The inner minimization performs dual conservative value estimation, aligning policies and trajectories to avoid out-of-distribution states and actions, while the outer maximization ensures that policy improvements remain consistent with inner value estimates. Empirical evaluations demonstrate that DAMO effectively ensures model and policy alignments, achieving competitive performance across diverse benchmark tasks.

Paper Structure

This paper contains 42 sections, 24 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: The distributions of transition pair $(s, a, s^{\prime})$ across three datasets (Offline Data, Real Data, and Synthetic Data) are shown for the hopper-medium-expert task, comparing MOPO and MOBILE with our proposed DAMO framework. The Real Data is collected through executing the policy trained using specific offline methods in the real environment. While MOPO and MOBILE align synthetic and offline data well (indicated by a high overlap), they fail to mitigate OOD states in the real environment (circled by an ellipse). In contrast, DAMO effectively eliminates the OOD region.
  • Figure 2: Intuitive examples for the OOD issue. The behavior policy $\pi_{\beta}$ primarily favors the top road ($a_{1}$) with a minor representation of the bottom road ($a_{3}$) and never selects the middle road ($a_{2}$). Thus, the dynamics model lacks representation of the $a_{2}$ and treats the $a_{3}$ as an OOD action. The policy $\pi$, trained on both offline and synthetic data, avoids $a_{3}$ when deployed in the real environment but struggles to evaluate $a_{2}$ due to its absence in $M$, resulting in occasionally getting into OOD state. In DAMO, $\pi$ is constrained by $\pi_{\beta}$, guiding the agent to consistently select $a_{1}$. At the bottom, we show that while data alignment ensures a similar distribution between offline and synthetic data, it fails to guarantee behavior consistency between $M$ and $T$.
  • Figure 3: Training process of DAMO, w/o ir, w/o er. (a) The data alignment term is the primary factor that influences policy improvement, while the absence of the behavior alignment term results in performance degradation in the final stages. (b) The lack of the data alignment term leads to the overestimation of the value.
  • Figure 4: (a)-(c) illustrate the results of DAMO without the data alignment term (w/o er), DAMO without the behavior alignment term (w/o ir), and standard DAMO. Left panel: state-action (s, a) distributions (offline vs. synthetic). Right panel: state distributions $s$ (real vs. synthetic).
  • Figure 5: (a) The tuning experiment of $\alpha$ was performed on the hopper-medium-expert dataset for validation. (b) The fixing experiment of $\alpha$ was performed on the halfcheetah-medium-expert dataset for validation.
  • ...and 1 more figures