On the Model-Misspecification in Reinforcement Learning
Yunfan Li, Lin Yang
TL;DR
This work analyzes model misspecification in reinforcement learning with general function approximation, showing that value-based and model-based methods can attain strong robustness under locally bounded misspecification. It introduces LBM-UCB, a unified framework that achieves regret $\widetilde{O}(\text{poly}(dH)(\sqrt{K} + K\zeta))$ by constructing confidence sets around the best empirical approximator and operating with average optimism via a virtual data-collection mechanism. The paper provides two concrete instantiations—Robust-LSVI for value-based and Robust-UCRL-VTR for model-based—and proves regret bounds that scale with the eluder dimension and the local misspecification $\zeta$, with a meta-algorithm to handle unknown $\zeta$. A key novelty is focusing on the policy-induced distribution to obtain average-optimism guarantees and enabling a parameter-free, practical approach for general-function classes. Overall, the framework broadens robust RL beyond linear settings and offers provable, scalable guidance for real-world applications with misspecified models.
Abstract
The success of reinforcement learning (RL) crucially depends on effective function approximation when dealing with complex ground-truth models. Existing sample-efficient RL algorithms primarily employ three approaches to function approximation: policy-based, value-based, and model-based methods. However, in the face of model misspecification (a disparity between the ground-truth and optimal function approximators), it is shown that policy-based approaches can be robust even when the policy function approximation is under a large locally-bounded misspecification error, with which the function class may exhibit a $Ω(1)$ approximation error in specific states and actions, but remains small on average within a policy-induced state distribution. Yet it remains an open question whether similar robustness can be achieved with value-based and model-based approaches, especially with general function approximation. To bridge this gap, in this paper we present a unified theoretical framework for addressing model misspecification in RL. We demonstrate that, through meticulous algorithm design and sophisticated analysis, value-based and model-based methods employing general function approximation can achieve robustness under local misspecification error bounds. In particular, they can attain a regret bound of $\widetilde{O}\left(\text{poly}(d H)(\sqrt{K} + Kζ) \right)$, where $d$ represents the complexity of the function class, $H$ is the episode length, $K$ is the total number of episodes, and $ζ$ denotes the local bound for misspecification error. Furthermore, we propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $ζ$, thereby enhancing its practical applicability.
