Table of Contents
Fetching ...

Learning Purposeful Behaviour in the Absence of Rewards

Marlos C. Machado, Michael Bowling

TL;DR

This paper tackles reward-sparse reinforcement learning by enabling purposeful behavior through automatic option discovery. It introduces POD, which builds temporally extended actions by identifying intrinsic purposes via singular value decomposition of observed state-feature changes, learning policies to maximize these purposes, and converting them into usable options. The approach yields eigenbehaviours that guide exploration to diverse parts of the state space, improving exploration without relying on external rewards. Results in a reward-free ring-world domain demonstrate growing option complexity and enhanced state-space coverage, with robustness to partial observability and potential applicability to larger, function-approximation settings.

Abstract

Artificial intelligence is commonly defined as the ability to achieve goals in the world. In the reinforcement learning framework, goals are encoded as reward functions that guide agent behaviour, and the sum of observed rewards provide a notion of progress. However, some domains have no such reward signal, or have a reward signal so sparse as to appear absent. Without reward feedback, agent behaviour is typically random, often dithering aimlessly and lacking intentionality. In this paper we present an algorithm capable of learning purposeful behaviour in the absence of rewards. The algorithm proceeds by constructing temporally extended actions (options), through the identification of purposes that are "just out of reach" of the agent's current behaviour. These purposes establish intrinsic goals for the agent to learn, ultimately resulting in a suite of behaviours that encourage the agent to visit different parts of the state space. Moreover, the approach is particularly suited for settings where rewards are very sparse, and such behaviours can help in the exploration of the environment until reward is observed.

Learning Purposeful Behaviour in the Absence of Rewards

TL;DR

This paper tackles reward-sparse reinforcement learning by enabling purposeful behavior through automatic option discovery. It introduces POD, which builds temporally extended actions by identifying intrinsic purposes via singular value decomposition of observed state-feature changes, learning policies to maximize these purposes, and converting them into usable options. The approach yields eigenbehaviours that guide exploration to diverse parts of the state space, improving exploration without relying on external rewards. Results in a reward-free ring-world domain demonstrate growing option complexity and enhanced state-space coverage, with robustness to partial observability and potential applicability to larger, function-approximation settings.

Abstract

Artificial intelligence is commonly defined as the ability to achieve goals in the world. In the reinforcement learning framework, goals are encoded as reward functions that guide agent behaviour, and the sum of observed rewards provide a notion of progress. However, some domains have no such reward signal, or have a reward signal so sparse as to appear absent. Without reward feedback, agent behaviour is typically random, often dithering aimlessly and lacking intentionality. In this paper we present an algorithm capable of learning purposeful behaviour in the absence of rewards. The algorithm proceeds by constructing temporally extended actions (options), through the identification of purposes that are "just out of reach" of the agent's current behaviour. These purposes establish intrinsic goals for the agent to learn, ultimately resulting in a suite of behaviours that encourage the agent to visit different parts of the state space. Moreover, the approach is particularly suited for settings where rewards are very sparse, and such behaviours can help in the exploration of the environment until reward is observed.

Paper Structure

This paper contains 7 sections, 4 theorems, 5 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Consider an option $o = \langle\mathcal{I}_o, \pi_o, \mathcal{T}_o\rangle$ discovered with Algorithm 1 where $\gamma~<~1$. Then $\mathcal{T}_o$ is nonempty.

Figures (1)

  • Figure 1: Sample random walk using primitive actions and a random walk using the discovered options. Dashed vertical lines represent iteration boundaries.

Theorems & Definitions (10)

  • Definition 3.1: Eigenpurpose
  • Definition 3.2: Eigenbehaviour
  • Theorem 3.1: Option's Termination
  • proof : Proof intuition
  • Lemma 5.1
  • proof
  • Lemma 5.2
  • proof
  • Theorem 5.1: Option's Termination
  • proof