Table of Contents
Fetching ...

Dual policy as self-model for planning

Jaesung Yoo, Fernanda de la Torre, Guangyu Robert Yang

TL;DR

The results on a ecologically relevant, parametric environment indicate that distilled policy network for self-model stabilizes training, has faster inference than using model-free policy, promotes better exploration, and could learn a comprehensive understanding of its own behaviors, at the cost of distilling a new network apart from the model- free policy.

Abstract

Planning is a data efficient decision-making strategy where an agent selects candidate actions by exploring possible future states. To simulate future states when there is a high-dimensional action space, the knowledge of one's decision making strategy must be used to limit the number of actions to be explored. We refer to the model used to simulate one's decisions as the agent's self-model. While self-models are implicitly used widely in conjunction with world models to plan actions, it remains unclear how self-models should be designed. Inspired by current reinforcement learning approaches and neuroscience, we explore the benefits and limitations of using a distilled policy network as the self-model. In such dual-policy agents, a model-free policy and a distilled policy are used for model-free actions and planned actions, respectively. Our results on a ecologically relevant, parametric environment indicate that distilled policy network for self-model stabilizes training, has faster inference than using model-free policy, promotes better exploration, and could learn a comprehensive understanding of its own behaviors, at the cost of distilling a new network apart from the model-free policy.

Dual policy as self-model for planning

TL;DR

The results on a ecologically relevant, parametric environment indicate that distilled policy network for self-model stabilizes training, has faster inference than using model-free policy, promotes better exploration, and could learn a comprehensive understanding of its own behaviors, at the cost of distilling a new network apart from the model- free policy.

Abstract

Planning is a data efficient decision-making strategy where an agent selects candidate actions by exploring possible future states. To simulate future states when there is a high-dimensional action space, the knowledge of one's decision making strategy must be used to limit the number of actions to be explored. We refer to the model used to simulate one's decisions as the agent's self-model. While self-models are implicitly used widely in conjunction with world models to plan actions, it remains unclear how self-models should be designed. Inspired by current reinforcement learning approaches and neuroscience, we explore the benefits and limitations of using a distilled policy network as the self-model. In such dual-policy agents, a model-free policy and a distilled policy are used for model-free actions and planned actions, respectively. Our results on a ecologically relevant, parametric environment indicate that distilled policy network for self-model stabilizes training, has faster inference than using model-free policy, promotes better exploration, and could learn a comprehensive understanding of its own behaviors, at the cost of distilling a new network apart from the model-free policy.
Paper Structure (11 sections, 4 equations, 6 figures, 7 tables)

This paper contains 11 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Concept of self model and its two distinct designs
  • Figure 2: Environment reward settings
  • Figure 3: Proportions of episode outcomes during training episodes. Simple, shared policy, and dual policy agent are compared. The shaded area is standard deviation.
  • Figure 4: Proportions of rewards during evaluation episodes across different model-free and distilled policy sizes. Error bars indicate 95% confidence intervals.
  • Figure 5: Proportions of reward types during training episodes for different map sizes. (a)-(c): Map size of 10 units, (d)-(f): Map size of 20 units, (g)-(i): Map size of 30 units. Simple, shared policy, and dual policy agent are compared. The shaded area is 95% confidence interval.
  • ...and 1 more figures