Table of Contents
Fetching ...

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

Jaehyun Park, Junyeop Kwon, Dabeen Lee

TL;DR

This work develops a provably efficient reinforcement learning algorithm, UCMNLK, for infinite-horizon MDPs with multinomial logistic (MNL) transition approximations. The approach builds confidence polytopes over transition probabilities and uses discounted extended value iteration (DEVI) to obtain optimism-based policies, achieving regret guarantees in both average-reward and discounted-reward settings. The paper establishes matching upper and lower bounds: $\tilde{\mathcal{O}}(dD\sqrt{T})$ and $\tilde{\mathcal{O}}(d(1-\gamma)^{-2}\sqrt{T})$ for the respective settings, with corresponding lower bounds $\Omega(d\sqrt{DT})$ and $\Omega(d(1-\gamma)^{3/2}\sqrt{T})$, as well as a finite-horizon bound $\Omega(dH^{3/2}\sqrt{K})$ that tightens prior results. A key technical contribution is the confidence-polytope construction, enabling tractable optimization over transition probabilities despite non-convexity in the logistic parameters, and a reduction-based technique to relate MNL transitions to linear-mixture MDPs for deriving lower bounds. The results collectively provide tight, theory-backed guarantees for RL with MNL function approximation, with potential for practical deployment in large-scale, structured RL problems where multinomial transitions are natural.

Abstract

We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of $\tilde{\mathcal{O}}(dD\sqrt{T})$ where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. For discounted-reward MDPs, our algorithm achieves $\tilde{\mathcal{O}}(d(1-γ)^{-2}\sqrt{T})$ regret where $γ$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of $Ω(d\sqrt{DT})$ for learning communicating MDPs of diameter $D$ and a lower bound of $Ω(d(1-γ)^{3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor $γ$. Lastly, we show a regret lower bound of $Ω(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

TL;DR

This work develops a provably efficient reinforcement learning algorithm, UCMNLK, for infinite-horizon MDPs with multinomial logistic (MNL) transition approximations. The approach builds confidence polytopes over transition probabilities and uses discounted extended value iteration (DEVI) to obtain optimism-based policies, achieving regret guarantees in both average-reward and discounted-reward settings. The paper establishes matching upper and lower bounds: and for the respective settings, with corresponding lower bounds and , as well as a finite-horizon bound that tightens prior results. A key technical contribution is the confidence-polytope construction, enabling tractable optimization over transition probabilities despite non-convexity in the logistic parameters, and a reduction-based technique to relate MNL transitions to linear-mixture MDPs for deriving lower bounds. The results collectively provide tight, theory-backed guarantees for RL with MNL function approximation, with potential for practical deployment in large-scale, structured RL problems where multinomial transitions are natural.

Abstract

We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of where is the dimension of feature mapping, is the diameter of the underlying MDP, and is the horizon. For discounted-reward MDPs, our algorithm achieves regret where is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of for learning communicating MDPs of diameter and a lower bound of for learning discounted-reward MDPs with discount factor . Lastly, we show a regret lower bound of for learning -horizon episodic MDPs with MNL function approximation where is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.
Paper Structure (49 sections, 41 theorems, 253 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 49 sections, 41 theorems, 253 equations, 2 figures, 1 table, 2 algorithms.

Key Result

Lemma 3.1

Suppose that Assumptions ass:L bound--ass:recenter hold. Let $\delta \in (0,1)$, $\eta = (1/2)\log\mathcal{U}+(L_\theta L_\varphi +1)$, and $\lambda \geq 84\sqrt{2}(L_\theta L_\varphi^3 + dL_\varphi^2)\eta$. With probability at least $1-\delta$, $\theta^*$ is contained in where $\beta_{t} = f(L_\theta,L_\varphi)\sqrt{d}(\log(\mathcal{U}t/\delta))^2$ for every $t\in [T]$ and $f$ is a polynomial in

Figures (2)

  • Figure 1: Illustration of the Hard Finite-Horizon MDP Instance
  • Figure 2: Illustration of the Hard-to-Learn Infinite-Horizon MDP Instance

Theorems & Definitions (41)

  • Lemma 3.1
  • Lemma 3.2
  • Theorem 1: Average-Reward
  • Theorem 2: Discounted-Reward
  • Lemma 3.3
  • Lemma 3.4
  • Lemma 3.5
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • ...and 31 more