Table of Contents
Fetching ...

Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning

Alex Ayoub, Kaiwen Wang, Vincent Liu, Samuel Robertson, James McInerney, Dawen Liang, Nathan Kallus, Csaba Szepesvári

TL;DR

This work introduces FQI-log, an offline batch RL algorithm that replaces squared loss with log-loss for regressing the action-value function, enabling small-cost bounds where the data-generating optimal cost $\bar{v}^\star$ is small. The authors prove a first efficient batch RL result that the suboptimality scales with $\sqrt{\bar{v}^\star}$ and other instance-dependent terms, via a contraction analysis in the Hellinger distance and a decomposition that couples a small-cost term with pointwise deviations from the optimum. They establish three core steps: error decomposition, contraction with respect to the Hellinger distance, and an error-propagation bound using log-loss concentration. Empirically, FQI-log demonstrates superior sample efficiency on goal-directed tasks (Mountain Car, Inverted Pendulum) and competitive performance on Atari games against squared-loss baselines and distributional methods, supporting the practical relevance of choosing log-loss in batch RL. The work suggests broader implications for offline RL, including potential extensions to infinite function classes and online settings, and points to the advantage of aligning loss choices with problem structure when optimal costs are small.

Abstract

We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.

Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning

TL;DR

This work introduces FQI-log, an offline batch RL algorithm that replaces squared loss with log-loss for regressing the action-value function, enabling small-cost bounds where the data-generating optimal cost is small. The authors prove a first efficient batch RL result that the suboptimality scales with and other instance-dependent terms, via a contraction analysis in the Hellinger distance and a decomposition that couples a small-cost term with pointwise deviations from the optimum. They establish three core steps: error decomposition, contraction with respect to the Hellinger distance, and an error-propagation bound using log-loss concentration. Empirically, FQI-log demonstrates superior sample efficiency on goal-directed tasks (Mountain Car, Inverted Pendulum) and competitive performance on Atari games against squared-loss baselines and distributional methods, supporting the practical relevance of choosing log-loss in batch RL. The work suggests broader implications for offline RL, including potential extensions to infinite function classes and online settings, and points to the advantage of aligning loss choices with problem structure when optimal costs are small.

Abstract

We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
Paper Structure (29 sections, 22 theorems, 84 equations, 4 figures, 1 algorithm)

This paper contains 29 sections, 22 theorems, 84 equations, 4 figures, 1 algorithm.

Key Result

Theorem 5.1

Given a dataset $D_n = \{(S_i,A_i,C_i,S_{i}')\}_{i=1}^n$ with $n\in\mathbb{N}$ and a finite function class $\mathcal{F} \subseteq [0,1]^{\mathcal{S} \times \mathcal{A}}$ that satisfy asp:dataasp:concentrabilityasp:realizabilityasp:completeness, it holds with probability $1-\delta$ that the suboptima where $N = \log(|\mathcal{F}|/\delta)$ and $C$ is defined in asp:concentrability.

Figures (4)

  • Figure 1: The value of the policy learned by FQI as a function of the size of batch dataset. The results are averaged over $90$ independently collected datasets. The figures on the left, middle and right are generated using batch data that contain only $1$, $5$, and $30$ successful trajectories respectively. The standard error of the mean is reported via the shaded region.
  • Figure 2: The portion of the time that a policy learned by FQI was able to balance the pendulum for $3000$ steps. Results are averaged over $90$ independently collected datasets and each learned policy is tested on $1000$ initializations. The standard error of the mean is shaded.
  • Figure 3: Learning curves on Asterix and Seaquest. The results are averaged over $5$ datasets. The shaded regions represent one standard error of the mean. One epoch contains 100k updates.
  • Figure 4: Learning curves on Asterix and Seaquest. The result are averaged over 5 datasets with one standard error. One epoch contains 100k updates.

Theorems & Definitions (43)

  • Definition 3.1: Admissible distribution
  • Theorem 5.1
  • Lemma 5.2
  • Proposition 5.3
  • Lemma 1.1
  • proof
  • Corollary 1.2
  • proof : Proof of \ref{['cor:hellinger-triangular-scalar']}
  • Theorem 1.3
  • proof
  • ...and 33 more