Table of Contents
Fetching ...

Regret Guarantees for Linear Contextual Stochastic Shortest Path

Dor Polikar, Alon Cohen

TL;DR

The paper studies linear CSSP, where adversarial contexts map linearly to SSP instances, and the learner must reach a fixed goal with minimal cumulative loss without access to the MDP dynamics. It proposes LR-CSSP, an optimism-based algorithm that maintains confidence sets over linear context embeddings for both losses and transitions, and switches to optimistic policies within interval-based learning to ensure termination. The main results are a regret bound of $\tilde{O}(K^{2/3} d^{2/3} |S| |A|^{1/3} B_\star^2 T_\star \log(1/\delta))$, and a tighter bound $\tilde{O}(\sqrt{K d^2 |S|^3 |A| B_\star^3 \log(1/\delta)/\ell_\text{min}})$ when all costs are bounded below by $\ell_\text{min}$; these hold with high probability and under relaxations of prior knowledge via standard tricks. The work extends SSP and CMDP literature to continuous contexts with linear structure, provides a tractable polynomial-time method using convex updates and Extended Value Iteration, and demonstrates termination guarantees even under partial information. This advances principled, sample-efficient contextual RL for stochastic path planning problems with a broad range of potential applications.

Abstract

We define the problem of linear Contextual Stochastic Shortest Path (CSSP), where at the beginning of each episode, the learner observes an adversarially chosen context that determines the MDP through a fixed but unknown linear function. The learner's objective is to reach a designated goal state with minimal expected cumulative loss, despite having no prior knowledge of the transition dynamics, loss functions, or the mapping from context to MDP. In this work, we propose LR-CSSP, an algorithm that achieves a regret bound of $\widetilde{O}(K^{2/3} d^{2/3} |S| |A|^{1/3} B_\star^2 T_\star \log (1/ δ))$, where $K$ is the number of episodes, $d$ is the context dimension, $S$ and $A$ are the sets of states and actions respectively, $B_\star$ bounds the optimal cumulative loss and $T_\star$, unknown to the learner, bounds the expected time for the optimal policy to reach the goal. In the case where all costs exceed $\ell_{\min}$, LR-CSSP attains a regret of $\widetilde O(\sqrt{K \cdot d^2 |S|^3 |A| B_\star^3 \log(1/δ)/\ell_{\min}})$. Unlike in contextual finite-horizon MDPs, where limited knowledge primarily leads to higher losses and regret, in the CSSP setting, insufficient knowledge can also prolong episodes and may even lead to non-terminating episodes. Our analysis reveals that LR-CSSP effectively handles continuous context spaces, while ensuring all episodes terminate within a reasonable number of time steps.

Regret Guarantees for Linear Contextual Stochastic Shortest Path

TL;DR

The paper studies linear CSSP, where adversarial contexts map linearly to SSP instances, and the learner must reach a fixed goal with minimal cumulative loss without access to the MDP dynamics. It proposes LR-CSSP, an optimism-based algorithm that maintains confidence sets over linear context embeddings for both losses and transitions, and switches to optimistic policies within interval-based learning to ensure termination. The main results are a regret bound of , and a tighter bound when all costs are bounded below by ; these hold with high probability and under relaxations of prior knowledge via standard tricks. The work extends SSP and CMDP literature to continuous contexts with linear structure, provides a tractable polynomial-time method using convex updates and Extended Value Iteration, and demonstrates termination guarantees even under partial information. This advances principled, sample-efficient contextual RL for stochastic path planning problems with a broad range of potential applications.

Abstract

We define the problem of linear Contextual Stochastic Shortest Path (CSSP), where at the beginning of each episode, the learner observes an adversarially chosen context that determines the MDP through a fixed but unknown linear function. The learner's objective is to reach a designated goal state with minimal expected cumulative loss, despite having no prior knowledge of the transition dynamics, loss functions, or the mapping from context to MDP. In this work, we propose LR-CSSP, an algorithm that achieves a regret bound of , where is the number of episodes, is the context dimension, and are the sets of states and actions respectively, bounds the optimal cumulative loss and , unknown to the learner, bounds the expected time for the optimal policy to reach the goal. In the case where all costs exceed , LR-CSSP attains a regret of . Unlike in contextual finite-horizon MDPs, where limited knowledge primarily leads to higher losses and regret, in the CSSP setting, insufficient knowledge can also prolong episodes and may even lead to non-terminating episodes. Our analysis reveals that LR-CSSP effectively handles continuous context spaces, while ensuring all episodes terminate within a reasonable number of time steps.

Paper Structure

This paper contains 23 sections, 25 theorems, 66 equations, 1 algorithm.

Key Result

Lemma 1

Suppose that there exists at least one proper policy and that for every improper policy $\pi'$ there exists at least one state $s \in S$ such that $\mathcal{V}^{\pi'}(s) = \infty$. Let $\pi$ be any policy, then

Theorems & Definitions (44)

  • Definition 1: Proper and Improper Policies
  • Lemma 1: bertsekas1991analysis
  • Lemma 2: bertsekas1991analysis
  • Theorem 2.1
  • Theorem 4.1
  • Corollary 4.1.1
  • Lemma 3
  • Lemma 4
  • proof
  • Lemma 5
  • ...and 34 more