PrivilegedDreamer: Explicit Imagination of Privileged Information for Rapid Adaptation of Learned Policies
Morgan Byrd, Jackson Crandell, Mili Das, Jessica Inman, Robert Wright, Sehoon Ha
TL;DR
PrivilegedDreamer tackles HIP-MDPs by adding an external LSTM-based estimator to explicitly predict hidden parameters and by conditioning the world model, actor, and critic on these estimates, extending DreamerV2 with a dual recurrent architecture. The method achieves rapid online parameter estimation and superior performance across five HIP-MDP tasks, including cases with parameterized rewards, demonstrating improved sample efficiency and final rewards over model-based, model-free, and domain-adaptation baselines. Ablation studies validate the necessity of the estimator, the additional prediction head, and HIP-conditioned networks. This work advances rapid adaptation in model-based RL and points to extensions in vision-based control and multi-agent HIP settings.
Abstract
Numerous real-world control problems involve dynamics and objectives affected by unobservable hidden parameters, ranging from autonomous driving to robotic manipulation, which cause performance degradation during sim-to-real transfer. To represent these kinds of domains, we adopt hidden-parameter Markov decision processes (HIP-MDPs), which model sequential decision problems where hidden variables parameterize transition and reward functions. Existing approaches, such as domain randomization, domain adaptation, and meta-learning, simply treat the effect of hidden parameters as additional variance and often struggle to effectively handle HIP-MDP problems, especially when the rewards are parameterized by hidden variables. We introduce Privileged-Dreamer, a model-based reinforcement learning framework that extends the existing model-based approach by incorporating an explicit parameter estimation module. PrivilegedDreamer features its novel dual recurrent architecture that explicitly estimates hidden parameters from limited historical data and enables us to condition the model, actor, and critic networks on these estimated parameters. Our empirical analysis on five diverse HIP-MDP tasks demonstrates that PrivilegedDreamer outperforms state-of-the-art model-based, model-free, and domain adaptation learning algorithms. Additionally, we conduct ablation studies to justify the inclusion of each component in the proposed architecture.
