Table of Contents
Fetching ...

Task Aware Dreamer for Task Generalization in Reinforcement Learning

Chengyang Ying, Xinning Zhou, Zhongkai Hao, Hang Su, Songming Liu, Dong Yan, Jun Zhu

TL;DR

The paper tackles generalization across task distributions in reinforcement learning where dynamics are shared but rewards differ. It introduces Task Aware Dreamer (TAD), a framework that learns reward-informed world models and a task-aware ELBO with a task-context term, enabling a history-encoding policy class $\Pi_3$ to distinguish tasks and improve generalization. The authors define Task Distribution Relevance ($D_{\text{TDR}}$) to quantify how task-differentiating a distribution is and prove that standard Markovian or history-based policies ($\Pi_1$, $\Pi_2$) are sub-optimal under high $D_{\text{TDR}}$, justifying the use of $\Pi_3$. They provide two training variants for the task term—cross-entropy and supervised-contrastive—and validate TAD across image-based and state-based domains, showing superior performance on unseen tasks and robustness to dynamic changes. The work advances multi-task generalization by combining theoretical insights with empirically strong, task-aware world models that better capture invariant structures across tasks and enable scalable zero-shot transfer.

Abstract

A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. The ability to generalize across tasks is important as it determines an agent's adaptability to real-world scenarios where reward mechanisms might vary. In this work, we first show that training a general world model can utilize similar structures in these tasks and help train more generalizable agents. Extending world models into the task generalization setting, we introduce a novel method named Task Aware Dreamer (TAD), which integrates reward-informed features to identify consistent latent characteristics across tasks. Within TAD, we compute the variational lower bound of sample data log-likelihood, which introduces a new term designed to differentiate tasks using their states, as the optimization objective of our reward-informed world models. To demonstrate the advantages of the reward-informed policy in TAD, we introduce a new metric called Task Distribution Relevance (TDR) which quantitatively measures the relevance of different tasks. For tasks exhibiting a high TDR, i.e., the tasks differ significantly, we illustrate that Markovian policies struggle to distinguish them, thus it is necessary to utilize reward-informed policies in TAD. Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.

Task Aware Dreamer for Task Generalization in Reinforcement Learning

TL;DR

The paper tackles generalization across task distributions in reinforcement learning where dynamics are shared but rewards differ. It introduces Task Aware Dreamer (TAD), a framework that learns reward-informed world models and a task-aware ELBO with a task-context term, enabling a history-encoding policy class to distinguish tasks and improve generalization. The authors define Task Distribution Relevance () to quantify how task-differentiating a distribution is and prove that standard Markovian or history-based policies (, ) are sub-optimal under high , justifying the use of . They provide two training variants for the task term—cross-entropy and supervised-contrastive—and validate TAD across image-based and state-based domains, showing superior performance on unseen tasks and robustness to dynamic changes. The work advances multi-task generalization by combining theoretical insights with empirically strong, task-aware world models that better capture invariant structures across tasks and enable scalable zero-shot transfer.

Abstract

A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. The ability to generalize across tasks is important as it determines an agent's adaptability to real-world scenarios where reward mechanisms might vary. In this work, we first show that training a general world model can utilize similar structures in these tasks and help train more generalizable agents. Extending world models into the task generalization setting, we introduce a novel method named Task Aware Dreamer (TAD), which integrates reward-informed features to identify consistent latent characteristics across tasks. Within TAD, we compute the variational lower bound of sample data log-likelihood, which introduces a new term designed to differentiate tasks using their states, as the optimization objective of our reward-informed world models. To demonstrate the advantages of the reward-informed policy in TAD, we introduce a new metric called Task Distribution Relevance (TDR) which quantitatively measures the relevance of different tasks. For tasks exhibiting a high TDR, i.e., the tasks differ significantly, we illustrate that Markovian policies struggle to distinguish them, thus it is necessary to utilize reward-informed policies in TAD. Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.
Paper Structure (46 sections, 6 theorems, 33 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 46 sections, 6 theorems, 33 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Set $\mathcal{Q}$ as the space of observation-action Q functions. Given $M$ tasks $\{\mathcal{M}_m\}_{m=1}^M$ and corresponding dataset $\mathcal{D}_m=\{(o_t^m,a_t^m, r_t^m, o_{t+1}^m)\}$, we set the product space $\mathcal{H} = \mathcal{Q}^M$ composed of M spaces, i.e., $\forall \{q_m\}_{m=1}^M\in\ Then we have $\mathcal{H}_3\subseteq \mathcal{H}_2\subseteq \mathcal{H}_1$.

Figures (6)

  • Figure 1: An overview. Given a task distribution, we train the agent in training tasks and hope it to zero-shot generalize to test tasks. For improving the generalization, we propose TAD, which utilizes $\Pi_3$ to encode all historical information for inferring the current task and novel reward-informed world models for capturing invariant latent features.
  • Figure 2: Probabilistic graphical model designs for the single-task setting (left) and the task-distribution setting (right). The latter inspires the design of reward-informed world models. Solid and dashed lines represent the generative process and the inference model, respectively.
  • Figure 3: Sampled trajectories and imaginary trajectories of TAD for different tasks (Cheetah-run and Cheetah-flip).
  • Figure 4: The t-SNE clustering of state embeddings for different tasks sampled via Dreamer, TAD-CE, and TAD-SC.
  • Figure 5: Ablation study on Reward Signal.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Theorem 1: Proof in Appendix A.1
  • Theorem 2: Sub-Optimality of $\Pi_1,\Pi_2$. Proof in Appendix A.3
  • Definition 1: Task Distribution Relevance
  • Theorem 3: Proof in Appendix A.4
  • Theorem 4: Informally, detailed analyses and proof are in Appendix A.6
  • proof
  • proof
  • proof
  • Proposition 1
  • proof
  • ...and 2 more