Table of Contents
Fetching ...

Informed Asymmetric Actor-Critic: Leveraging Privileged Signals Beyond Full-State Access

Daniel Ebi, Gaspard Lambrechts, Damien Ernst, Klemens Böhm

TL;DR

This work tackles learning under partial observability by enabling the critic to access privileged training-time signals without requiring full-state access. It introduces Informed Asymmetric Actor-Critic (IAAC), proves that policy gradients remain unbiased under arbitrary privileged inputs, and formalizes two informativeness criteria—HSCIC-based pre-training and return-prediction error-based post-training—for selecting useful signals. Empirically, IAAC improves learning efficiency and value estimation on benchmark navigation tasks and synthetic informed POMDPs, sometimes surpassing full-state-augmented baselines. The results show that privileged partial information can enhance training when appropriately informative, challenging the necessity of full-state access and guiding practical design of asymmetric RL methods.

Abstract

Reinforcement learning in partially observable environments requires agents to act under uncertainty from noisy, incomplete observations. Asymmetric actor-critic methods leverage privileged information during training to improve learning under these conditions. However, existing approaches typically assume full-state access during training. In this work, we challenge this assumption by proposing a novel actor-critic framework, called informed asymmetric actor-critic, that enables conditioning the critic on arbitrary privileged signals without requiring access to the full state. We show that policy gradients remain unbiased under this formulation, extending the theoretical foundation of asymmetric methods to the more general case of privileged partial information. To quantify the impact of such signals, we propose informativeness measures based on kernel methods and return prediction error, providing practical tools for evaluating training-time signals. We validate our approach empirically on benchmark navigation tasks and synthetic partially observable environments, showing that our informed asymmetric method improves learning efficiency and value estimation when informative privileged inputs are available. Our findings challenge the necessity of full-state access and open new directions for designing asymmetric reinforcement learning methods that are both practical and theoretically sound.

Informed Asymmetric Actor-Critic: Leveraging Privileged Signals Beyond Full-State Access

TL;DR

This work tackles learning under partial observability by enabling the critic to access privileged training-time signals without requiring full-state access. It introduces Informed Asymmetric Actor-Critic (IAAC), proves that policy gradients remain unbiased under arbitrary privileged inputs, and formalizes two informativeness criteria—HSCIC-based pre-training and return-prediction error-based post-training—for selecting useful signals. Empirically, IAAC improves learning efficiency and value estimation on benchmark navigation tasks and synthetic informed POMDPs, sometimes surpassing full-state-augmented baselines. The results show that privileged partial information can enhance training when appropriately informative, challenging the necessity of full-state access and guiding practical design of asymmetric RL methods.

Abstract

Reinforcement learning in partially observable environments requires agents to act under uncertainty from noisy, incomplete observations. Asymmetric actor-critic methods leverage privileged information during training to improve learning under these conditions. However, existing approaches typically assume full-state access during training. In this work, we challenge this assumption by proposing a novel actor-critic framework, called informed asymmetric actor-critic, that enables conditioning the critic on arbitrary privileged signals without requiring access to the full state. We show that policy gradients remain unbiased under this formulation, extending the theoretical foundation of asymmetric methods to the more general case of privileged partial information. To quantify the impact of such signals, we propose informativeness measures based on kernel methods and return prediction error, providing practical tools for evaluating training-time signals. We validate our approach empirically on benchmark navigation tasks and synthetic partially observable environments, showing that our informed asymmetric method improves learning efficiency and value estimation when informative privileged inputs are available. Our findings challenge the necessity of full-state access and open new directions for designing asymmetric reinforcement learning methods that are both practical and theoretically sound.

Paper Structure

This paper contains 35 sections, 7 theorems, 34 equations, 2 figures, 2 tables.

Key Result

Lemma 4.1

In an informed POMDP, the informed history-based reward function $R(h, i, a)$ satisfies for all $h \in \mathcal{H}$ and $a \in \mathcal{A}$, where the expectation is taken under the belief $p(i \mid h)$.

Figures (2)

  • Figure 1: Learning performance on six benchmark navigation tasks. Curves show episodic returns averaged over the last 100 episodes, with means and standard deviations computed across 20 independent runs.
  • Figure 2: Boxplot distributions of $\epsilon$ over 1,000 test episodes for synthetic POMDP instances with privileged signal $i_t$ ($\varsigma = 0.1$), computed for (a) different $\delta$ across 20 instances; (b) fixed $\delta = 0.05$ for five randomly sampled instances.

Theorems & Definitions (15)

  • Definition 4.1: Informed history-based reward function
  • Lemma 4.1: Unbiasedness of the informed history-based reward
  • proof
  • Definition 4.2: Informed history $Q$-function
  • Lemma 4.2: Unbiasedness of the informed $Q$-function
  • proof
  • Definition 4.3: Informed history value function
  • Lemma 4.3: Unbiasedness of the informed value function
  • proof
  • Theorem 4.1: Informed asymmetric policy gradient
  • ...and 5 more