Table of Contents
Fetching ...

Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

Lingxiao Wang, Qi Cai, Zhuoran Yang, Zhaoran Wang

TL;DR

Embed to Control (ETC) tackles sample inefficiency in partially observable MDPs by learning two complementary representations: a low-dimensional per-step state feature that factorizes the transition and a multi-step history embedding built from these features. The core mechanism combines a forward emission operator with a Bellman operator to decompose trajectory density into stepwise components, enabling a tractable, sample-efficient planning process under a low-rank transition assumption. The authors prove an $O(1/\epsilon^2)$ sample complexity for achieving an $\epsilon$-suboptimal policy, with polynomial dependence on horizon $H$ and intrinsic rank $d$, and provide a flexible data-collection and density-estimation framework that supports multiple estimators. This work thus delivers the first theoretical bridge between representation learning and policy optimization in POMDPs with infinite observation and state spaces, with practical implications for sample-efficient control under partial observability.

Abstract

Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/ε^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $ε$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.

Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

TL;DR

Embed to Control (ETC) tackles sample inefficiency in partially observable MDPs by learning two complementary representations: a low-dimensional per-step state feature that factorizes the transition and a multi-step history embedding built from these features. The core mechanism combines a forward emission operator with a Bellman operator to decompose trajectory density into stepwise components, enabling a tractable, sample-efficient planning process under a low-rank transition assumption. The authors prove an sample complexity for achieving an -suboptimal policy, with polynomial dependence on horizon and intrinsic rank , and provide a flexible data-collection and density-estimation framework that supports multiple estimators. This work thus delivers the first theoretical bridge between representation learning and policy optimization in POMDPs with infinite observation and state spaces, with practical implications for sample-efficient control under partial observability.

Abstract

Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
Paper Structure (36 sections, 19 theorems, 169 equations, 1 figure, 2 tables, 2 algorithms)

This paper contains 36 sections, 19 theorems, 169 equations, 1 figure, 2 tables, 2 algorithms.

Key Result

Lemma 3.6

We define linear operator $\mathbb{U}^{\theta, \dagger}_{h}: L^1(\mathcal{A}^{k}\times\mathcal{O}^{k+1}) \mapsto L^1({\mathcal{S}})$ for all $\theta\in\Theta$ and $h\in[H]$ as follows, (U^θ, †_h f)(s_h) =∫_A^kO^k+1 ψ^θ_h-1(s_h)^⊤ M^θ, †_h g^θ_h(τ^h+k_h)⋅ f(τ_h^h+k) d τ^h+k_h, Here $\mathbb{P}^{\theta, \pi}_h \in L^1({\mathcal{S}})$ maps from all state $s_h\in{\mathcal{S}}$ to the probability $\ma

Figures (1)

  • Figure 1: The directed acyclic graph (DAG) of a POMDP with low-rank transition. Here $\{s_h, s_{h+1}\}$, $\{o_h, o_{h+1}\}$, $a_h$, $r_h$ are the states, observations, action, and reward, respectively. In addition, we denote by $q_h$ the bottleneck factor induced by the low-rank transition, which depends on the state and action pair $(s_h, a_h)$ and determines the density of next state $s_{h+1}$. In the DAG, we represent observable and unobservable variables by the shaded and unshaded nodes, respectively. In addition, we use the dashed node and arrows for the latent factor $q_h$ and its corresponding transitions, respectively, to differentiate such bottlenect factor from the state of the POMDP.

Theorems & Definitions (26)

  • Definition 3.2: Function Approximation
  • Definition 3.4: Forward Emission Operator
  • Lemma 3.6: Pseudo-Inverse of Forward Emission
  • Definition 3.7: Bellman Operator
  • Lemma 3.8: Embedding Decomposition
  • Theorem 5.3
  • Lemma B.1: Performance Difference
  • Lemma B.2: Norm Bound of Bellman Operator
  • Definition B.3: Reverse Emission
  • Lemma B.4: Good Event Probability
  • ...and 16 more