Dual policy as self-model for planning

Jaesung Yoo; Fernanda de la Torre; Guangyu Robert Yang

Dual policy as self-model for planning

Jaesung Yoo, Fernanda de la Torre, Guangyu Robert Yang

TL;DR

The results on a ecologically relevant, parametric environment indicate that distilled policy network for self-model stabilizes training, has faster inference than using model-free policy, promotes better exploration, and could learn a comprehensive understanding of its own behaviors, at the cost of distilling a new network apart from the model- free policy.

Abstract

Planning is a data efficient decision-making strategy where an agent selects candidate actions by exploring possible future states. To simulate future states when there is a high-dimensional action space, the knowledge of one's decision making strategy must be used to limit the number of actions to be explored. We refer to the model used to simulate one's decisions as the agent's self-model. While self-models are implicitly used widely in conjunction with world models to plan actions, it remains unclear how self-models should be designed. Inspired by current reinforcement learning approaches and neuroscience, we explore the benefits and limitations of using a distilled policy network as the self-model. In such dual-policy agents, a model-free policy and a distilled policy are used for model-free actions and planned actions, respectively. Our results on a ecologically relevant, parametric environment indicate that distilled policy network for self-model stabilizes training, has faster inference than using model-free policy, promotes better exploration, and could learn a comprehensive understanding of its own behaviors, at the cost of distilling a new network apart from the model-free policy.

Dual policy as self-model for planning

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 6 figures, 7 tables)

This paper contains 11 sections, 4 equations, 6 figures, 7 tables.

Introduction
Hypothesized pros and cons of different self-model implementations
Experiment setup
Environment
Networks
Results
Dual policy agent achieves similar performance to shared policy agent with better stability
Dual policy agent have faster planning inference with smaller distilled networks
Dual policy enhances exploration
Dual policy learns comprehensive understanding of the agent's behaviors
Discussion

Figures (6)

Figure 1: Concept of self model and its two distinct designs
Figure 2: Environment reward settings
Figure 3: Proportions of episode outcomes during training episodes. Simple, shared policy, and dual policy agent are compared. The shaded area is standard deviation.
Figure 4: Proportions of rewards during evaluation episodes across different model-free and distilled policy sizes. Error bars indicate 95% confidence intervals.
Figure 5: Proportions of reward types during training episodes for different map sizes. (a)-(c): Map size of 10 units, (d)-(f): Map size of 20 units, (g)-(i): Map size of 30 units. Simple, shared policy, and dual policy agent are compared. The shaded area is 95% confidence interval.
...and 1 more figures

Dual policy as self-model for planning

TL;DR

Abstract

Dual policy as self-model for planning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)