Table of Contents
Fetching ...

Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning

Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji

TL;DR

This paper tackles extrapolation errors in offline RL by proposing Adaptive Neighborhood-Constrained Q Learning (ANQ), which restricts Bellman-target actions to the union of ε-neighborhoods around dataset actions and adapts these radii per data point using a derived advantage-based rule. The method is implemented via a bilevel optimization framework that performs inner maximization over neighborhoods with an auxiliary policy and outer expectile regression to approximate the best neighborhood value, followed by weighted regression toward optimized actions for policy extraction. Theoretical results show the neighborhood constraint bounds extrapolation and distribution shift under mild standardness assumptions and provides a practical approximation to the least restrictive support constraint without explicit behavior policy modeling. Empirically, ANQ achieves state-of-the-art results on D4RL Gym locomotion and AntMaze benchmarks and demonstrates robustness to noisy and limited data, with competitive runtime compared to other fast offline RL methods.

Abstract

Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.

Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning

TL;DR

This paper tackles extrapolation errors in offline RL by proposing Adaptive Neighborhood-Constrained Q Learning (ANQ), which restricts Bellman-target actions to the union of ε-neighborhoods around dataset actions and adapts these radii per data point using a derived advantage-based rule. The method is implemented via a bilevel optimization framework that performs inner maximization over neighborhoods with an auxiliary policy and outer expectile regression to approximate the best neighborhood value, followed by weighted regression toward optimized actions for policy extraction. Theoretical results show the neighborhood constraint bounds extrapolation and distribution shift under mild standardness assumptions and provides a practical approximation to the least restrictive support constraint without explicit behavior policy modeling. Empirically, ANQ achieves state-of-the-art results on D4RL Gym locomotion and AntMaze benchmarks and demonstrates robustness to noisy and limited data, with competitive runtime compared to other fast offline RL methods.

Abstract

Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.

Paper Structure

This paper contains 46 sections, 9 theorems, 51 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

If any of the conditions $\mathrm{D_{KL}}(\pi\|\pi_\beta) \leq 2\epsilon$, $\mathrm{D_{KL}}(\pi_\beta\|\pi) \leq 2\epsilon$, or $\mathrm{D_{TV}}(\pi,\pi_\beta) \leq \sqrt{\epsilon}$ holds, then the policy performance $\eta$ is bounded as follows:

Figures (8)

  • Figure 1: (a) Evaluation on noisy datasets over five random seeds. (b) Evaluation of ANQ on noisy datasets with varying inverse temperature $\alpha$ that determines the adaptiveness of neighborhood radius.
  • Figure 2: (a) Evaluation on reduced datasets over five random seeds. (b) Evaluation of ANQ on reduced datasets with varying Lagrange multiplier $\lambda$ that controls the overall radius of neighborhoods.
  • Figure 3: Performance and Q values of ANQ with varying Lagrange multiplier $\lambda$ over five random seeds. The crosses $\times$ mean that the value functions diverge in some seeds. As $\lambda$ decreases, ANQ enables larger overall neighborhood radii, resulting in higher and probably divergent learned Q values. A moderate $\lambda$ (neighborhood constraint) is crucial for achieving superior performance.
  • Figure 4: Performance and Q values of ANQ with varying inverse temperature $\alpha$ over five random seeds. An appropriately large $\alpha$ (adaptive neighborhoods) yields enhanced performance.
  • Figure 5: Runtime of algorithms on halfcheetah-medium-replay-v2 on a GeForce RTX 3090.
  • ...and 3 more figures

Theorems & Definitions (19)

  • Definition 1: Density constraint
  • Lemma 1: Performance bound under density constraints
  • Definition 2: Support constraint
  • Definition 3: Sample constraint
  • Definition 4: Neighborhood constraint
  • Theorem 1: Support approximation via neighborhoods
  • Lemma 2: Extrapolation behavior
  • Proposition 1: Distribution shift
  • Definition 5: Adaptive neighborhood constraint
  • Lemma 3
  • ...and 9 more