To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable Reinforcement Learning
Yuda Song, Dhruv Rohatgi, Aarti Singh, J. Andrew Bagnell
TL;DR
This work interrogates the use of privileged latent-state information to train policies for partially observable tasks. By introducing the delta-perturbed Block MDP and separating approximate decodability from belief contraction, the authors show a fundamental trade-off: distillation can rival RL when latent dynamics are deterministic, but standard RL with frame-stacking outperforms distillation as stochasticity grows. They further propose smoothing the latent expert to yield more robust distillation, and provide theoretical and empirical results illustrating when each approach excels. The findings offer practical guidelines for leveraging privileged information to improve policy learning in partially observable domains, bridging theoretical insights with locomotion benchmarks.
Abstract
Partial observability is a notorious challenge in reinforcement learning (RL), due to the need to learn complex, history-dependent policies. Recent empirical successes have used privileged expert distillation--which leverages availability of latent state information during training (e.g., from a simulator) to learn and imitate the optimal latent, Markovian policy--to disentangle the task of "learning to see" from "learning to act". While expert distillation is more computationally efficient than RL without latent state information, it also has well-documented failure modes. In this paper--through a simple but instructive theoretical model called the perturbed Block MDP, and controlled experiments on challenging simulated locomotion tasks--we investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information. Our main findings are: (1) The trade-off empirically hinges on the stochasticity of the latent dynamics, as theoretically predicted by contrasting approximate decodability with belief contraction in the perturbed Block MDP; and (2) The optimal latent policy is not always the best latent policy to distill. Our results suggest new guidelines for effectively exploiting privileged information, potentially advancing the efficiency of policy learning across many practical partially observable domains.
