Task Aware Dreamer for Task Generalization in Reinforcement Learning
Chengyang Ying, Xinning Zhou, Zhongkai Hao, Hang Su, Songming Liu, Dong Yan, Jun Zhu
TL;DR
The paper tackles generalization across task distributions in reinforcement learning where dynamics are shared but rewards differ. It introduces Task Aware Dreamer (TAD), a framework that learns reward-informed world models and a task-aware ELBO with a task-context term, enabling a history-encoding policy class $\Pi_3$ to distinguish tasks and improve generalization. The authors define Task Distribution Relevance ($D_{\text{TDR}}$) to quantify how task-differentiating a distribution is and prove that standard Markovian or history-based policies ($\Pi_1$, $\Pi_2$) are sub-optimal under high $D_{\text{TDR}}$, justifying the use of $\Pi_3$. They provide two training variants for the task term—cross-entropy and supervised-contrastive—and validate TAD across image-based and state-based domains, showing superior performance on unseen tasks and robustness to dynamic changes. The work advances multi-task generalization by combining theoretical insights with empirically strong, task-aware world models that better capture invariant structures across tasks and enable scalable zero-shot transfer.
Abstract
A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. The ability to generalize across tasks is important as it determines an agent's adaptability to real-world scenarios where reward mechanisms might vary. In this work, we first show that training a general world model can utilize similar structures in these tasks and help train more generalizable agents. Extending world models into the task generalization setting, we introduce a novel method named Task Aware Dreamer (TAD), which integrates reward-informed features to identify consistent latent characteristics across tasks. Within TAD, we compute the variational lower bound of sample data log-likelihood, which introduces a new term designed to differentiate tasks using their states, as the optimization objective of our reward-informed world models. To demonstrate the advantages of the reward-informed policy in TAD, we introduce a new metric called Task Distribution Relevance (TDR) which quantitatively measures the relevance of different tasks. For tasks exhibiting a high TDR, i.e., the tasks differ significantly, we illustrate that Markovian policies struggle to distinguish them, thus it is necessary to utilize reward-informed policies in TAD. Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.
