Table of Contents
Fetching ...

Budgeting Counterfactual for Offline RL

Yao Liu, Pratik Chaudhari, Rasool Fakoor

TL;DR

Offline RL suffers from extrapolation errors due to out-of-distribution actions as the planning horizon grows. BCOL introduces a budget on counterfactual decisions and a dynamic-programming–based counterfactual-budgeting Bellman operator, $\\mathcal{T}_{\\text{CB}}$, whose fixed point is the optimal $Q$-function $Q^{\\star}$ under the budget. The approach combines a practical offline actor-critic algorithm with a monotonicity penalty and demonstrates strong empirical performance on the D4RL benchmark (MuJoCo and AntMaze), while ablations highlight the importance of budgeting during both training and testing. Overall, BCOL provides a principled, tunable mechanism to cap extrapolation risk while enabling targeted improvements, with significant practical impact for safe offline reinforcement learning.

Abstract

The main challenge of offline reinforcement learning, where data is limited, arises from a sequence of counterfactual reasoning dilemmas within the realm of potential actions: What if we were to choose a different course of action? These circumstances frequently give rise to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Hence, it becomes crucial to acknowledge that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy make in order to control the extrapolation. Contrary to existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. Theoretically, we justify our method by the constrained optimality of the fixed point solution to our $Q$ updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.

Budgeting Counterfactual for Offline RL

TL;DR

Offline RL suffers from extrapolation errors due to out-of-distribution actions as the planning horizon grows. BCOL introduces a budget on counterfactual decisions and a dynamic-programming–based counterfactual-budgeting Bellman operator, , whose fixed point is the optimal -function under the budget. The approach combines a practical offline actor-critic algorithm with a monotonicity penalty and demonstrates strong empirical performance on the D4RL benchmark (MuJoCo and AntMaze), while ablations highlight the importance of budgeting during both training and testing. Overall, BCOL provides a principled, tunable mechanism to cap extrapolation risk while enabling targeted improvements, with significant practical impact for safe offline reinforcement learning.

Abstract

The main challenge of offline reinforcement learning, where data is limited, arises from a sequence of counterfactual reasoning dilemmas within the realm of potential actions: What if we were to choose a different course of action? These circumstances frequently give rise to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Hence, it becomes crucial to acknowledge that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy make in order to control the extrapolation. Contrary to existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. Theoretically, we justify our method by the constrained optimality of the fixed point solution to our updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.
Paper Structure (16 sections, 1 theorem, 22 equations, 4 figures, 8 tables)

This paper contains 16 sections, 1 theorem, 22 equations, 4 figures, 8 tables.

Key Result

Theorem 2

There exists a unique fixed point of $\mathcal{T}_{\text{CB}}$, and it isHere $Q^{\star}(s, b, a)$ is defined as given $b_1 = b$ rather than $b_0$ because the action $a$ is already given and the future (optimal) value should be independent to which distribution $a$ is drawn from.

Figures (4)

  • Figure 1: Grid world example
  • Figure 2: Total normalized score with different values of $B$ and $\omega$ in BCOL. The left two plots show MuJoCo average scores and the right two plots show AntMaze average scores.
  • Figure 3: Percent difference of the performance on different budgeting methods compared with the full BCOL Algorithm (hc = HalfCheetah, hop = Hopper, w = Walker2d, am=AntMaze). The top row shows SAC-based experiments and the bottom row shows TD3-based experiments. TD3 plots do not include AntMaze-large tasks since the performances of BCOL are zero. No budgeting stands for offline SAC/TD3 without the budgeting constraints (equivalent to $B \to \infty$). Budgeting without planning stands for randomly selecting $B$ steps to follow from $\pi$ and the rest from $\Hat{\mu}$ during the test, where $\pi$ is learned by offline SAC/TD3. Budgeting without test-time planning stands for randomly selecting $B$ steps (uniformly within the max horizon) to follow from $\pi$ and the rest from $\Hat{\mu}$ during the test, where $\pi$ is learned by Algorithm \ref{['alg:ours']}. In all settings, $B$ is the same value as selected by BCOL .
  • Figure 4: Learning curves for CDC, BCOL (SAC), TD3+BC, and BCOL (TD3) in D4RL tasks.

Theorems & Definitions (4)

  • Definition 1: Counterfactual-Budgeting Bellman Operator
  • Theorem 2
  • Definition 3: Approximate Counterfactual-Budgeting Bellman Operator
  • proof