Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning
Alex Ayoub, Kaiwen Wang, Vincent Liu, Samuel Robertson, James McInerney, Dawen Liang, Nathan Kallus, Csaba Szepesvári
TL;DR
This work introduces FQI-log, an offline batch RL algorithm that replaces squared loss with log-loss for regressing the action-value function, enabling small-cost bounds where the data-generating optimal cost $\bar{v}^\star$ is small. The authors prove a first efficient batch RL result that the suboptimality scales with $\sqrt{\bar{v}^\star}$ and other instance-dependent terms, via a contraction analysis in the Hellinger distance and a decomposition that couples a small-cost term with pointwise deviations from the optimum. They establish three core steps: error decomposition, contraction with respect to the Hellinger distance, and an error-propagation bound using log-loss concentration. Empirically, FQI-log demonstrates superior sample efficiency on goal-directed tasks (Mountain Car, Inverted Pendulum) and competitive performance on Atari games against squared-loss baselines and distributional methods, supporting the practical relevance of choosing log-loss in batch RL. The work suggests broader implications for offline RL, including potential extensions to infinite function classes and online settings, and points to the advantage of aligning loss choices with problem structure when optimal costs are small.
Abstract
We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
