On the Model-Misspecification in Reinforcement Learning

Yunfan Li; Lin Yang

On the Model-Misspecification in Reinforcement Learning

Yunfan Li, Lin Yang

TL;DR

This work analyzes model misspecification in reinforcement learning with general function approximation, showing that value-based and model-based methods can attain strong robustness under locally bounded misspecification. It introduces LBM-UCB, a unified framework that achieves regret $\widetilde{O}(\text{poly}(dH)(\sqrt{K} + K\zeta))$ by constructing confidence sets around the best empirical approximator and operating with average optimism via a virtual data-collection mechanism. The paper provides two concrete instantiations—Robust-LSVI for value-based and Robust-UCRL-VTR for model-based—and proves regret bounds that scale with the eluder dimension and the local misspecification $\zeta$, with a meta-algorithm to handle unknown $\zeta$. A key novelty is focusing on the policy-induced distribution to obtain average-optimism guarantees and enabling a parameter-free, practical approach for general-function classes. Overall, the framework broadens robust RL beyond linear settings and offers provable, scalable guidance for real-world applications with misspecified models.

Abstract

The success of reinforcement learning (RL) crucially depends on effective function approximation when dealing with complex ground-truth models. Existing sample-efficient RL algorithms primarily employ three approaches to function approximation: policy-based, value-based, and model-based methods. However, in the face of model misspecification (a disparity between the ground-truth and optimal function approximators), it is shown that policy-based approaches can be robust even when the policy function approximation is under a large locally-bounded misspecification error, with which the function class may exhibit a $Ω(1)$ approximation error in specific states and actions, but remains small on average within a policy-induced state distribution. Yet it remains an open question whether similar robustness can be achieved with value-based and model-based approaches, especially with general function approximation. To bridge this gap, in this paper we present a unified theoretical framework for addressing model misspecification in RL. We demonstrate that, through meticulous algorithm design and sophisticated analysis, value-based and model-based methods employing general function approximation can achieve robustness under local misspecification error bounds. In particular, they can attain a regret bound of $\widetilde{O}\left(\text{poly}(d H)(\sqrt{K} + Kζ) \right)$, where $d$ represents the complexity of the function class, $H$ is the episode length, $K$ is the total number of episodes, and $ζ$ denotes the local bound for misspecification error. Furthermore, we propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $ζ$, thereby enhancing its practical applicability.

On the Model-Misspecification in Reinforcement Learning

TL;DR

by constructing confidence sets around the best empirical approximator and operating with average optimism via a virtual data-collection mechanism. The paper provides two concrete instantiations—Robust-LSVI for value-based and Robust-UCRL-VTR for model-based—and proves regret bounds that scale with the eluder dimension and the local misspecification

, with a meta-algorithm to handle unknown

. A key novelty is focusing on the policy-induced distribution to obtain average-optimism guarantees and enabling a parameter-free, practical approach for general-function classes. Overall, the framework broadens robust RL beyond linear settings and offers provable, scalable guidance for real-world applications with misspecified models.

Abstract

approximation error in specific states and actions, but remains small on average within a policy-induced state distribution. Yet it remains an open question whether similar robustness can be achieved with value-based and model-based approaches, especially with general function approximation. To bridge this gap, in this paper we present a unified theoretical framework for addressing model misspecification in RL. We demonstrate that, through meticulous algorithm design and sophisticated analysis, value-based and model-based methods employing general function approximation can achieve robustness under local misspecification error bounds. In particular, they can attain a regret bound of

, where

represents the complexity of the function class,

is the episode length,

is the total number of episodes, and

denotes the local bound for misspecification error. Furthermore, we propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of

, thereby enhancing its practical applicability.

Paper Structure (34 sections, 37 theorems, 213 equations, 6 algorithms)

This paper contains 34 sections, 37 theorems, 213 equations, 6 algorithms.

Introduction
Related work
Misspecified Bandit
RL with Function Approximations.
Preliminary
Episodic RL with Finite-Horizon Markov Decision Process
Function Approximation
Value-based Function Approximation
Model-based Function Approximation
Notation
Robust RL Algorithms with General Function Approximation
Generic Framework: LBM-UCB
LBM-UCB for Value-based Algorithm
LBM-UCB for Model-based Algorithm
Theoretical Analysis of Robust RL Algorithms with General Function Approximation
...and 19 more sections

Key Result

Theorem 5.5

Under our Assumption ass:cover and Ass:general-value, for any fixed $\delta \in (0,1)$, with probability at least $1-\delta$, the total regret of Algorithm Algorithm general known is at most $\widetilde{O}\left(\sqrt{d_EH^3}K\zeta \log(1/\delta)+ \sqrt{d_E^2KH^3} \log(1/\delta) \right)$, where $d_E$

Theorems & Definitions (46)

Definition 5.1: Eluder dimension
Remark 5.3
Theorem 5.5: Regret bound with known $\zeta$
Remark 5.6
Remark 5.7
Theorem 5.10: Regret bound with known $\zeta$
Remark 5.11
Remark 5.12
Theorem 6.1: Regret bound with unknown $\zeta$
Remark B.2
...and 36 more

On the Model-Misspecification in Reinforcement Learning

TL;DR

Abstract

On the Model-Misspecification in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (46)