Table of Contents
Fetching ...

To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable Reinforcement Learning

Yuda Song, Dhruv Rohatgi, Aarti Singh, J. Andrew Bagnell

TL;DR

This work interrogates the use of privileged latent-state information to train policies for partially observable tasks. By introducing the delta-perturbed Block MDP and separating approximate decodability from belief contraction, the authors show a fundamental trade-off: distillation can rival RL when latent dynamics are deterministic, but standard RL with frame-stacking outperforms distillation as stochasticity grows. They further propose smoothing the latent expert to yield more robust distillation, and provide theoretical and empirical results illustrating when each approach excels. The findings offer practical guidelines for leveraging privileged information to improve policy learning in partially observable domains, bridging theoretical insights with locomotion benchmarks.

Abstract

Partial observability is a notorious challenge in reinforcement learning (RL), due to the need to learn complex, history-dependent policies. Recent empirical successes have used privileged expert distillation--which leverages availability of latent state information during training (e.g., from a simulator) to learn and imitate the optimal latent, Markovian policy--to disentangle the task of "learning to see" from "learning to act". While expert distillation is more computationally efficient than RL without latent state information, it also has well-documented failure modes. In this paper--through a simple but instructive theoretical model called the perturbed Block MDP, and controlled experiments on challenging simulated locomotion tasks--we investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information. Our main findings are: (1) The trade-off empirically hinges on the stochasticity of the latent dynamics, as theoretically predicted by contrasting approximate decodability with belief contraction in the perturbed Block MDP; and (2) The optimal latent policy is not always the best latent policy to distill. Our results suggest new guidelines for effectively exploiting privileged information, potentially advancing the efficiency of policy learning across many practical partially observable domains.

To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable Reinforcement Learning

TL;DR

This work interrogates the use of privileged latent-state information to train policies for partially observable tasks. By introducing the delta-perturbed Block MDP and separating approximate decodability from belief contraction, the authors show a fundamental trade-off: distillation can rival RL when latent dynamics are deterministic, but standard RL with frame-stacking outperforms distillation as stochasticity grows. They further propose smoothing the latent expert to yield more robust distillation, and provide theoretical and empirical results illustrating when each approach excels. The findings offer practical guidelines for leveraging privileged information to improve policy learning in partially observable domains, bridging theoretical insights with locomotion benchmarks.

Abstract

Partial observability is a notorious challenge in reinforcement learning (RL), due to the need to learn complex, history-dependent policies. Recent empirical successes have used privileged expert distillation--which leverages availability of latent state information during training (e.g., from a simulator) to learn and imitate the optimal latent, Markovian policy--to disentangle the task of "learning to see" from "learning to act". While expert distillation is more computationally efficient than RL without latent state information, it also has well-documented failure modes. In this paper--through a simple but instructive theoretical model called the perturbed Block MDP, and controlled experiments on challenging simulated locomotion tasks--we investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information. Our main findings are: (1) The trade-off empirically hinges on the stochasticity of the latent dynamics, as theoretically predicted by contrasting approximate decodability with belief contraction in the perturbed Block MDP; and (2) The optimal latent policy is not always the best latent policy to distill. Our results suggest new guidelines for effectively exploiting privileged information, potentially advancing the efficiency of policy learning across many practical partially observable domains.

Paper Structure

This paper contains 67 sections, 29 theorems, 129 equations, 9 figures, 4 tables.

Key Result

lemma 1

Let $\pi^{\mathsf{latent}} \in \Pi^{\mathsf{latent}}$ be a latent policy and let $\widetilde{\mathbf{b}}_{1:H}$ be a collection of functions $\widetilde{\mathbf{b}}_h: \mathcal{X}^h \times \mathcal{A}^{h-1} \to \Delta(\mathcal{S}_h)$. Define executable policies $\widetilde{\pi},\pi$ by and $\pi(x_{1:h},a_{1:h-1}) := \pi^{\mathsf{latent}} \circ \mathbf{b}_h(x_{1:h},a_{1:h-1})$ (see eq:belief-pi-co

Figures (9)

  • Figure 1: The performance of (offline/online) expert distillation and RL with respect to wall-clock time. We repeat each experiment 5 times and plot the mean and standard deviation. For the time complexity of BC, we include the data collection time, and amortize it over the training steps. For both BC and $\mathtt{DAgger}$, we include the time to train the latent expert (also amortized).
  • Figure 2: The normalized suboptimality of the expert distillation algorithms (top: behavior cloning; bottom: $\mathtt{DAgger}$) with respect to the horizon. We repeat 5 runs for each horizon and task, and perform linear regression on the results from each task. Note that the trajectory rewards for this plot have been normalized by horizon (and by action-prediction error), so linear scaling indicates compounding errors.
  • Figure 3: Performance of $\mathtt{DAgger}$ and RL with different frame-stacks on humanoid-walk and dog-walk with motor noise. We repeat each experiment 5 times and plot the mean and standard deviation. Note that in general, the improvement of RL over $\mathtt{DAgger}$ increases with the motor noise.
  • Figure 4: Belief contraction error with respect to the framestack $L = \{2,3,4,5\}$ on all tasks. For each framestack $L$, we train a Gaussian parametrized neural network to predict the belief with $L$ framestack input. We compute the KL distance to the output of an $L=10$ network (serving as an approximation of the true belief), averaged over a validation dataset with 100 episodes of data. The orange plot denotes the decrease in KL divergence between two numbers of framestacks. We repeat each experiment 5 times and plot the mean and standard deviation. We observe that the belief contraction error decreases (although not as fast as predicted by the theory) as the number of framestack increases.
  • Figure 5: Performance of $\mathtt{DAgger}$ on the validation dataset for the humanoid-walk and dog-walk environments with motor noise $\sigma=0.2$, as the noise level for the training environment (i.e. the environment in which the latent expert was trained) varies over $\{0.1,0.2,0.3,0.4,0.5\}$.
  • ...and 4 more figures

Theorems & Definitions (66)

  • definition 1
  • definition 2: Decodability Error
  • lemma 1: See \ref{['lemma:tv-beltil-to-latent']}
  • definition 3: Belief Contraction Error golowich2023planning
  • theorem 1: Informal; see \ref{['thm:golowich']}; due to golowich2022learning
  • definition 4
  • theorem 2: See \ref{['thm:belief-contraction']}
  • corollary 1: Informal; see \ref{['cor:golowich-perturbed-mdp']}
  • proposition 1: See \ref{['prop:vinf-decay']}
  • theorem 3: See \ref{['thm:perturbed-mdp-il-guarantee']}
  • ...and 56 more