Table of Contents
Fetching ...

Stability and Generalization for Bellman Residuals

Enoch H. Kang, Kyoungseok Jang

TL;DR

This work addresses the statistical generalization of Bellman residual minimization (BRM) in offline reinforcement learning and offline inverse reinforcement learning. It leverages the Polyak–Łojasiewicz structure of the BRM minimax reformulation and introduces a Lyapunov potential to couple SGDA trajectories on neighboring datasets, achieving an on-average $O(1/n)$ stability bound. This stability transfers to an $O(1/n)$ generalization (and excess MSBE) bound for BRM without variance reduction, extra regularization, or independence assumptions on minibatch sampling, and it applies to standard neural-network parameterizations with minibatch SGD. The results close the statistical gap for BRM and imply improved sample-efficiency guarantees for offline BRM-based RL/IRL methods, with concrete, constructively derived constants. Overall, the paper provides a rigorous, broadly applicable stability-based analysis that strengthens BRM as a principled offline learning objective.

Abstract

Offline reinforcement learning and offline inverse reinforcement learning aim to recover near-optimal value functions or reward models from a fixed batch of logged trajectories, yet current practice still struggles to enforce Bellman consistency. Bellman residual minimization (BRM) has emerged as an attractive remedy, as a globally convergent stochastic gradient descent-ascent based method for BRM has been recently discovered. However, its statistical behavior in the offline setting remains largely unexplored. In this paper, we close this statistical gap. Our analysis introduces a single Lyapunov potential that couples SGDA runs on neighbouring datasets and yields an O(1/n) on-average argument-stability bound-doubling the best known sample-complexity exponent for convex-concave saddle problems. The same stability constant translates into the O(1/n) excess risk bound for BRM, without variance reduction, extra regularization, or restrictive independence assumptions on minibatch sampling. The results hold for standard neural-network parameterizations and minibatch SGD.

Stability and Generalization for Bellman Residuals

TL;DR

This work addresses the statistical generalization of Bellman residual minimization (BRM) in offline reinforcement learning and offline inverse reinforcement learning. It leverages the Polyak–Łojasiewicz structure of the BRM minimax reformulation and introduces a Lyapunov potential to couple SGDA trajectories on neighboring datasets, achieving an on-average stability bound. This stability transfers to an generalization (and excess MSBE) bound for BRM without variance reduction, extra regularization, or independence assumptions on minibatch sampling, and it applies to standard neural-network parameterizations with minibatch SGD. The results close the statistical gap for BRM and imply improved sample-efficiency guarantees for offline BRM-based RL/IRL methods, with concrete, constructively derived constants. Overall, the paper provides a rigorous, broadly applicable stability-based analysis that strengthens BRM as a principled offline learning objective.

Abstract

Offline reinforcement learning and offline inverse reinforcement learning aim to recover near-optimal value functions or reward models from a fixed batch of logged trajectories, yet current practice still struggles to enforce Bellman consistency. Bellman residual minimization (BRM) has emerged as an attractive remedy, as a globally convergent stochastic gradient descent-ascent based method for BRM has been recently discovered. However, its statistical behavior in the offline setting remains largely unexplored. In this paper, we close this statistical gap. Our analysis introduces a single Lyapunov potential that couples SGDA runs on neighbouring datasets and yields an O(1/n) on-average argument-stability bound-doubling the best known sample-complexity exponent for convex-concave saddle problems. The same stability constant translates into the O(1/n) excess risk bound for BRM, without variance reduction, extra regularization, or restrictive independence assumptions on minibatch sampling. The results hold for standard neural-network parameterizations and minibatch SGD.

Paper Structure

This paper contains 19 sections, 18 theorems, 164 equations, 1 table, 1 algorithm.

Key Result

Lemma 1

For any $Q \in \mathcal{Q}$ and any state-action pair $(s,a)$, the expectation of the Sampled Bellman operator over the next state $s'$ recovers the original Bellman operator: Consequently, the expected TD error is equal to the Bellman error:

Theorems & Definitions (36)

  • Definition 1: Bellman Error (Bellman Residual)
  • Definition 2: Temporal-Difference Error
  • Lemma 1: Relationship between Bellman and TD Errors
  • Lemma 2: Global convergence of minibatch SGDA in parameter space yang2020globalconvergencevariancereducedoptimizationkang2025empiricalriskminimizationapproach
  • Definition 3: On-average algorithmic stability
  • Theorem 3: On-average argument stability of SGDA without i.i.d. sampling
  • Corollary 4: Explicit bound under a harmonic stepsize schedule
  • Definition 4: Primal Risk
  • Definition 5: Weak primal–dual risk
  • Lemma 5: Theorem 5, Wang2022MCSGM
  • ...and 26 more