Sample Complexity Characterization for Linear Contextual MDPs

Junze Deng; Yuan Cheng; Shaofeng Zou; Yingbin Liang

Sample Complexity Characterization for Linear Contextual MDPs

Junze Deng, Yuan Cheng, Shaofeng Zou, Yingbin Liang

TL;DR

This work studies CMDPs where both transitions and rewards vary with context and are amenable to linear function approximation. It introduces two frameworks: Model I with context-varying representations and common weights, and Model II with common representations and context-varying weights, both tackled with model-based algorithms and novel optimistic bonuses. The authors prove guaranteed $ε$-suboptimality with polynomial sample complexity, removing reachability restrictions in tabular CMDPs and providing the first results for the second framework. The findings suggest that context-varying features can substantially improve sample efficiency, offering practical avenues for data-efficient RL in nonstationary environments. This advances theoretical understanding and provides concrete, scalable strategies for learning in context-rich MDPs.

Abstract

Contextual Markov decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. While CMDPs serve as an important framework to model many real-world applications with time-varying environments, they are largely unexplored from theoretical perspective. In this paper, we study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights. For both models, we propose novel model-based algorithms and show that they enjoy guaranteed $ε$-suboptimality gap with desired polynomial sample complexity. In particular, instantiating our result for the first model to the tabular CMDP improves the existing result by removing the reachability assumption. Our result for the second model is the first-known result for such a type of function approximation models. Comparison between our results for the two models further indicates that having context-varying features leads to much better sample efficiency than having common representations for all contexts under linear CMDPs.

Sample Complexity Characterization for Linear Contextual MDPs

TL;DR

-suboptimality with polynomial sample complexity, removing reachability restrictions in tabular CMDPs and providing the first results for the second framework. The findings suggest that context-varying features can substantially improve sample efficiency, offering practical avenues for data-efficient RL in nonstationary environments. This advances theoretical understanding and provides concrete, scalable strategies for learning in context-rich MDPs.

Abstract

-suboptimality gap with desired polynomial sample complexity. In particular, instantiating our result for the first model to the tabular CMDP improves the existing result by removing the reachability assumption. Our result for the second model is the first-known result for such a type of function approximation models. Comparison between our results for the two models further indicates that having context-varying features leads to much better sample efficiency than having common representations for all contexts under linear CMDPs.

Paper Structure (20 sections, 28 theorems, 140 equations, 2 algorithms)

This paper contains 20 sections, 28 theorems, 140 equations, 2 algorithms.

Introduction
Related Work
Preliminaries and Problem Formulation
Contextual MDPs
Two Linear Function Approximation Models for CMDPs
Model I: Varying Representation
Algorithm
Theoretical Analysis
Model II: Varying Linear Weights
Algorithm
Theoretical Analysis
Conclusion
Proof of \ref{['thm:model 1']}
Supporting Lemmas
Proof of \ref{['thm:model 1']}
...and 5 more sections

Key Result

Theorem 1

Consider a CMDP with varying representations as defined in def:model 1. Under ass:mu(s), for any $\delta\in(0,1)$, with probability at least $1-3\delta/2$, the sequence of policies $\{\pi_{w_n}^n\}_{n=1}^N$ generated by alg1 satisfies that where $\lambda = \min\{ \lambda_1,\xi_1 \}$. To achieve an $\epsilon$ average sub-optimality gap, at most $\mathcal{O}\left(\frac{H^4d^3\log(|\Psi_1|/\delta)}{

Theorems & Definitions (52)

Definition 1: Model I: CMDPs with varying representation
Definition 2: Model II: CMDPs with varying linear weights
Remark 1
Theorem 1
Remark 2
Theorem 2
Lemma 1
Lemma 2
Lemma 3
proof
...and 42 more

Sample Complexity Characterization for Linear Contextual MDPs

TL;DR

Abstract

Sample Complexity Characterization for Linear Contextual MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (52)