Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

Jaehyun Park; Junyeop Kwon; Dabeen Lee

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

Jaehyun Park, Junyeop Kwon, Dabeen Lee

TL;DR

This work develops a provably efficient reinforcement learning algorithm, UCMNLK, for infinite-horizon MDPs with multinomial logistic (MNL) transition approximations. The approach builds confidence polytopes over transition probabilities and uses discounted extended value iteration (DEVI) to obtain optimism-based policies, achieving regret guarantees in both average-reward and discounted-reward settings. The paper establishes matching upper and lower bounds: $\tilde{\mathcal{O}}(dD\sqrt{T})$ and $\tilde{\mathcal{O}}(d(1-\gamma)^{-2}\sqrt{T})$ for the respective settings, with corresponding lower bounds $\Omega(d\sqrt{DT})$ and $\Omega(d(1-\gamma)^{3/2}\sqrt{T})$, as well as a finite-horizon bound $\Omega(dH^{3/2}\sqrt{K})$ that tightens prior results. A key technical contribution is the confidence-polytope construction, enabling tractable optimization over transition probabilities despite non-convexity in the logistic parameters, and a reduction-based technique to relate MNL transitions to linear-mixture MDPs for deriving lower bounds. The results collectively provide tight, theory-backed guarantees for RL with MNL function approximation, with potential for practical deployment in large-scale, structured RL problems where multinomial transitions are natural.

Abstract

We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of $\tilde{\mathcal{O}}(dD\sqrt{T})$ where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. For discounted-reward MDPs, our algorithm achieves $\tilde{\mathcal{O}}(d(1-γ)^{-2}\sqrt{T})$ regret where $γ$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of $Ω(d\sqrt{DT})$ for learning communicating MDPs of diameter $D$ and a lower bound of $Ω(d(1-γ)^{3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor $γ$. Lastly, we show a regret lower bound of $Ω(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

TL;DR

and

for the respective settings, with corresponding lower bounds

and

, as well as a finite-horizon bound

that tightens prior results. A key technical contribution is the confidence-polytope construction, enabling tractable optimization over transition probabilities despite non-convexity in the logistic parameters, and a reduction-based technique to relate MNL transitions to linear-mixture MDPs for deriving lower bounds. The results collectively provide tight, theory-backed guarantees for RL with MNL function approximation, with potential for practical deployment in large-scale, structured RL problems where multinomial transitions are natural.

Abstract

where

is the dimension of feature mapping,

is the diameter of the underlying MDP, and

is the horizon. For discounted-reward MDPs, our algorithm achieves

regret where

is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of

for learning communicating MDPs of diameter

and a lower bound of

for learning discounted-reward MDPs with discount factor

. Lastly, we show a regret lower bound of

for learning

-horizon episodic MDPs with MNL function approximation where

is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

Paper Structure (49 sections, 41 theorems, 253 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 49 sections, 41 theorems, 253 equations, 2 figures, 1 table, 2 algorithms.

Introduction
Our Contributions
Preliminaries
Notations
Infinite-Horizon Average-Reward MDP
Discounted-Reward MDP
Multinomial Logistic Model
Algorithm and Regret Bounds
Confidence Polytope for the True Transition Function
Algorithm Description of UCMLK
Regret Analysis of UCMNLK
Regret Lower Bounds
Lower Bound for Learning Finite-Horizon Episodic MDPs
Lower Bounds for Learning Infinite-Horizon MDPs
Conclusion
...and 34 more sections

Key Result

Lemma 3.1

Suppose that Assumptions ass:L bound--ass:recenter hold. Let $\delta \in (0,1)$, $\eta = (1/2)\log\mathcal{U}+(L_\theta L_\varphi +1)$, and $\lambda \geq 84\sqrt{2}(L_\theta L_\varphi^3 + dL_\varphi^2)\eta$. With probability at least $1-\delta$, $\theta^*$ is contained in where $\beta_{t} = f(L_\theta,L_\varphi)\sqrt{d}(\log(\mathcal{U}t/\delta))^2$ for every $t\in [T]$ and $f$ is a polynomial in

Figures (2)

Figure 1: Illustration of the Hard Finite-Horizon MDP Instance
Figure 2: Illustration of the Hard-to-Learn Infinite-Horizon MDP Instance

Theorems & Definitions (41)

Lemma 3.1
Lemma 3.2
Theorem 1: Average-Reward
Theorem 2: Discounted-Reward
Lemma 3.3
Lemma 3.4
Lemma 3.5
Theorem 3
Theorem 4
Theorem 5
...and 31 more

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

TL;DR

Abstract

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (41)