Table of Contents
Fetching ...

Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, Sheng-Jun Huang

TL;DR

This work tackles offline RL’s extrapolation risk by replacing a fixed global regularization with state-adaptive coefficients that vary with the reliability of Bellman updates. It unifies value-regularization (CQL) and explicit policy constraint approaches through a learnable state-dependent regulator and introduces distribution-aware thresholds and selective regularization on high-quality actions, with extensions to deterministic policies and efficient offline-to-online tuning. Empirically, the method yields substantial improvements over CQL and TD3+BC across D4RL tasks and enables effective online fine-tuning with minimal offline data reliance via linear annealing of the coefficients. Overall, the approach offers a scalable, task- and data-aware mechanism to harness Bellman updates more effectively in offline and offline-to-online reinforcement learning.

Abstract

Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark.

Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

TL;DR

This work tackles offline RL’s extrapolation risk by replacing a fixed global regularization with state-adaptive coefficients that vary with the reliability of Bellman updates. It unifies value-regularization (CQL) and explicit policy constraint approaches through a learnable state-dependent regulator and introduces distribution-aware thresholds and selective regularization on high-quality actions, with extensions to deterministic policies and efficient offline-to-online tuning. Empirically, the method yields substantial improvements over CQL and TD3+BC across D4RL tasks and enables effective online fine-tuning with minimal offline data reliance via linear annealing of the coefficients. Overall, the approach offers a scalable, task- and data-aware mechanism to harness Bellman updates more effectively in offline and offline-to-online reinforcement learning.

Abstract

Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark.

Paper Structure

This paper contains 35 sections, 1 theorem, 25 equations, 4 figures, 6 tables, 1 algorithm.

Key Result

Proposition 3.1

With the policy $\pi$ modeled as a Boltzmann distribution, e.g. $\pi(a|s) \propto \exp{(Q(s,a))}$, the regularization term of Eq. cql is equivalent to the negative log-likelihood term about $\pi$ at the dataset actions, that is

Figures (4)

  • Figure 1: Comparison of Uniform vs. Selective Regularization. The values are evaluated by the policies trained by with different regularization on all data in the dataset.
  • Figure 2: t-sne visualization of different sub-dataset selection methods
  • Figure 3: Offline performance comparisons of different types of the coefficient used for the regularization.
  • Figure 4: Offline performance comparisons of different selection methods for the sub-dataset.

Theorems & Definitions (1)

  • Proposition 3.1