Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning
Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji
TL;DR
This paper tackles extrapolation errors in offline RL by proposing Adaptive Neighborhood-Constrained Q Learning (ANQ), which restricts Bellman-target actions to the union of ε-neighborhoods around dataset actions and adapts these radii per data point using a derived advantage-based rule. The method is implemented via a bilevel optimization framework that performs inner maximization over neighborhoods with an auxiliary policy and outer expectile regression to approximate the best neighborhood value, followed by weighted regression toward optimized actions for policy extraction. Theoretical results show the neighborhood constraint bounds extrapolation and distribution shift under mild standardness assumptions and provides a practical approximation to the least restrictive support constraint without explicit behavior policy modeling. Empirically, ANQ achieves state-of-the-art results on D4RL Gym locomotion and AntMaze benchmarks and demonstrates robustness to noisy and limited data, with competitive runtime compared to other fast offline RL methods.
Abstract
Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.
