Table of Contents
Fetching ...

Behavior Preference Regression for Offline Reinforcement Learning

Padmanaba Srinivasan, William Knottenbelt

TL;DR

This paper tackles offline reinforcement learning by reframing policy optimization as a paired-sample regression problem that favors actions with high likelihood under a behavior model while maximizing reward. It derives a reverse KL-constrained objective where the reference policy is tied to a soft-Q estimate and the regression target uses a log-density of an implicit behavior policy modeled as an energy-based model, enabling direct optimization of policy density without explicit partition functions. The proposed Behavior Preference Regression (BPR) achieves state-of-the-art results on D4RL Locomotion and Antmaze and performs well on the image-based V-D4RL suite, with strong on-policy stability in many settings. The method relies on self-play sampling, an SAC-like actor-critic implementation, and lightweight hyperparameter tuning (notably λ ≈ 1). Overall, BPR provides a flexible, scalable framework for aligning offline policies with high-reward behaviors while staying within dataset support, and its insights extend to on-policy value settings and ensemble strategies for further gains.

Abstract

Offline reinforcement learning (RL) methods aim to learn optimal policies with access only to trajectories in a fixed dataset. Policy constraint methods formulate policy learning as an optimization problem that balances maximizing reward with minimizing deviation from the behavior policy. Closed form solutions to this problem can be derived as weighted behavioral cloning objectives that, in theory, must compute an intractable partition function. Reinforcement learning has gained popularity in language modeling to align models with human preferences; some recent works consider paired completions that are ranked by a preference model following which the likelihood of the preferred completion is directly increased. We adapt this approach of paired comparison. By reformulating the paired-sample optimization problem, we fit the maximum-mode of the Q function while maximizing behavioral consistency of policy actions. This yields our algorithm, Behavior Preference Regression for offline RL (BPR). We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite, which operates in image-based state spaces. BPR demonstrates state-of-the-art performance over all domains. Our on-policy experiments suggest that BPR takes advantage of the stability of on-policy value functions with minimal perceptible performance degradation on Locomotion datasets.

Behavior Preference Regression for Offline Reinforcement Learning

TL;DR

This paper tackles offline reinforcement learning by reframing policy optimization as a paired-sample regression problem that favors actions with high likelihood under a behavior model while maximizing reward. It derives a reverse KL-constrained objective where the reference policy is tied to a soft-Q estimate and the regression target uses a log-density of an implicit behavior policy modeled as an energy-based model, enabling direct optimization of policy density without explicit partition functions. The proposed Behavior Preference Regression (BPR) achieves state-of-the-art results on D4RL Locomotion and Antmaze and performs well on the image-based V-D4RL suite, with strong on-policy stability in many settings. The method relies on self-play sampling, an SAC-like actor-critic implementation, and lightweight hyperparameter tuning (notably λ ≈ 1). Overall, BPR provides a flexible, scalable framework for aligning offline policies with high-reward behaviors while staying within dataset support, and its insights extend to on-policy value settings and ensemble strategies for further gains.

Abstract

Offline reinforcement learning (RL) methods aim to learn optimal policies with access only to trajectories in a fixed dataset. Policy constraint methods formulate policy learning as an optimization problem that balances maximizing reward with minimizing deviation from the behavior policy. Closed form solutions to this problem can be derived as weighted behavioral cloning objectives that, in theory, must compute an intractable partition function. Reinforcement learning has gained popularity in language modeling to align models with human preferences; some recent works consider paired completions that are ranked by a preference model following which the likelihood of the preferred completion is directly increased. We adapt this approach of paired comparison. By reformulating the paired-sample optimization problem, we fit the maximum-mode of the Q function while maximizing behavioral consistency of policy actions. This yields our algorithm, Behavior Preference Regression for offline RL (BPR). We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite, which operates in image-based state spaces. BPR demonstrates state-of-the-art performance over all domains. Our on-policy experiments suggest that BPR takes advantage of the stability of on-policy value functions with minimal perceptible performance degradation on Locomotion datasets.

Paper Structure

This paper contains 33 sections, 2 theorems, 12 equations, 4 tables, 1 algorithm.

Key Result

Proposition 1

(Perfect Preference Model) If the preference function $\texttt{P}(s, a_1, a_2)$ is perfect i.e. $\tilde{Q}^* = Q^* + \pi_{\beta}$ is accurate, then the deterministic policies $\pi_{\beta}$ and $\tilde{\pi}$ satisfy:

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2