Table of Contents
Fetching ...

Provable Partially Observable Reinforcement Learning with Privileged Information

Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang

TL;DR

This work addresses the challenge of partial observability in reinforcement learning by systematically studying the use of privileged information during training. It analyzes two practical paradigms—expert policy distillation and asymmetric actor-critic—and identifies conditions under which these approaches become provably efficient, notably the deterministic filter condition and belief-learning mechanisms. The authors develop a theory-anchored suite of algorithms, including belief-weighted AAC and decoding-function learners, that achieve polynomial sample complexity and (quasi-)polynomial computational time, and extend these ideas to partially observable multi-agent settings under CTDE. The results illuminate when privileged information yields real gains in sample efficiency and computation, and offer a foundation for further advancing principled RL in structured POMDPs and MARL with privileged data.

Abstract

Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain \emph{privileged information}, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of \emph{expert distillation} (also known as \emph{teacher-student} learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the \emph{deterministic filter condition}, under which expert distillation achieves sample and computational complexities that are \emph{both} polynomial. Furthermore, we investigate another useful empirical paradigm of \emph{asymmetric actor-critic}, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve \emph{filter stability} under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms featuring \emph{centralized-training-with-decentralized-execution}, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

Provable Partially Observable Reinforcement Learning with Privileged Information

TL;DR

This work addresses the challenge of partial observability in reinforcement learning by systematically studying the use of privileged information during training. It analyzes two practical paradigms—expert policy distillation and asymmetric actor-critic—and identifies conditions under which these approaches become provably efficient, notably the deterministic filter condition and belief-learning mechanisms. The authors develop a theory-anchored suite of algorithms, including belief-weighted AAC and decoding-function learners, that achieve polynomial sample complexity and (quasi-)polynomial computational time, and extend these ideas to partially observable multi-agent settings under CTDE. The results illuminate when privileged information yields real gains in sample efficiency and computation, and offer a foundation for further advancing principled RL in structured POMDPs and MARL with privileged data.

Abstract

Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain \emph{privileged information}, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of \emph{expert distillation} (also known as \emph{teacher-student} learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the \emph{deterministic filter condition}, under which expert distillation achieves sample and computational complexities that are \emph{both} polynomial. Furthermore, we investigate another useful empirical paradigm of \emph{asymmetric actor-critic}, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve \emph{filter stability} under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms featuring \emph{centralized-training-with-decentralized-execution}, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

Paper Structure

This paper contains 63 sections, 41 theorems, 60 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

Proposition 3.1

For any $\epsilon, \gamma\in(0, 1)$, there exists a $\gamma$-observable POMDP $\mathcal{P}^\epsilon$ with $H=1$, $S=O=A=2$ such that for any behavior policy $\pi^\prime\in\Pi^{\text{gen}}$ and choice of $D_f$ in eq:kl, it holds that $v^{\mathcal{P}^\epsilon}(\widehat{\pi}^\star)\le \max_{\pi\in\Pi}

Figures (3)

  • Figure 1: A landscape of POMDP models that partially observable RL with privileged information addresses, with both statistical and computational complexity considerations. The $x$ and $y$ axes denote the "restrictiveness" of the assumptions, on the emission channels/observations and transition dynamics, respectively.
  • Figure 2: Results for POMDPs of different moderate sizes, where our methods achieve the best performance with the lowest sample complexity (VI: value iteration; AAC: asymmetric actor-critic).
  • Figure 3: Results for POMDPs of larger sizes, where our methods achieve the best performance with the lowest sample complexity (VI: value iteration; AAC: asymmetric actor-critic).

Theorems & Definitions (66)

  • Definition 2.1: $\epsilon$-optimal policy
  • Definition 2.2: $\epsilon$-approximate Nash equilibrium with information sharing
  • Definition 2.3: $\epsilon$-approximate coarse correlated equilibrium with information sharing
  • Definition 2.4: $\epsilon$-approximate correlated equilibrium with information sharing
  • Proposition 3.1: Pitfall of expert policy distillation
  • Definition 3.2: Deterministic filter condition
  • Example 3.3: Deterministic POMDP jin2020sampleuehara2023computationally
  • Example 3.4: Block MDP krishnamurthy2016pacdu2019provably
  • Example 3.5: $k$-step decodable POMDP EfroniJKM22
  • Example 3.6: POMDP with arbitrary, unknown decodable length
  • ...and 56 more