Table of Contents
Fetching ...

Optimism in Reinforcement Learning with Generalized Linear Function Approximation

Yining Wang, Ruosong Wang, Simon S. Du, Akshay Krishnamurthy

TL;DR

The paper addresses exploration in episodic reinforcement learning with large state spaces by leveraging generalized linear models to approximate the optimal Q-function. It replaces strong dynamics assumptions with an optimistic-closure expressivity condition and develops the LSVI-UCB algorithm to maintain optimistic Q-value estimates via GLM-based Bellman backups. The main contribution is a tight regret bound of $\tilde{O}(H\sqrt{d^3T})$, establishing the first statistically and computationally efficient RL method with GLM function approximation under mild assumptions, and connecting to both tabular and linear-MDP regimes. The work also clarifies the relation between optimistic closure and existing dynamic-closure notions, and suggests future directions toward broader function classes and weaker assumptions for RL with function approximation.

Abstract

We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call "optimistic closure," which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where $d$ is the dimensionality of the state-action features and $T$ is the number of episodes. This is the first statistically and computationally efficient algorithm for reinforcement learning with generalized linear functions.

Optimism in Reinforcement Learning with Generalized Linear Function Approximation

TL;DR

The paper addresses exploration in episodic reinforcement learning with large state spaces by leveraging generalized linear models to approximate the optimal Q-function. It replaces strong dynamics assumptions with an optimistic-closure expressivity condition and develops the LSVI-UCB algorithm to maintain optimistic Q-value estimates via GLM-based Bellman backups. The main contribution is a tight regret bound of , establishing the first statistically and computationally efficient RL method with GLM function approximation under mild assumptions, and connecting to both tabular and linear-MDP regimes. The work also clarifies the relation between optimistic closure and existing dynamic-closure notions, and suggests future directions toward broader function classes and weaker assumptions for RL with function approximation.

Abstract

We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call "optimistic closure," which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of where is the dimensionality of the state-action features and is the number of episodes. This is the first statistically and computationally efficient algorithm for reinforcement learning with generalized linear functions.

Paper Structure

This paper contains 19 sections, 18 theorems, 56 equations.

Key Result

Proposition 1

If an MDP is linear then assum:completeness holds with $\mathcal{G} = \{ (s,a) \mapsto \langle w,\psi(s,a)\rangle: w \in \mathbb{B}_d\}$ .

Theorems & Definitions (20)

  • Definition 1
  • Definition 2
  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Corollary 2
  • Lemma 1
  • Corollary 3
  • Lemma 2
  • Corollary 4
  • ...and 10 more