Table of Contents
Fetching ...

Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Guillermo Infante, David Kuric, Anders Jonsson, Vicenç Gómez, Herke van Hoof

TL;DR

The paper addresses the challenge of solving multiple tasks with non-Markovian rewards by learning a policy basis of subpolicies through successor features. It couples a low-level SF-based MDP family with high-level FSAs to form a product MDP and plans over augmented exit states using SF-FSA-VI, guaranteeing global optimality in stochastic settings when the CCS is optimal. Key contributions include the SF-FSA-VI algorithm, a theoretical optimality guarantee via GPI on a CCS, and empirical validation in Delivery and Office domains showing faster learning and planning than baselines. The approach provides zero-shot generalization to new FSA-described tasks while maintaining interpretability through explicit policy composition and planning over exit states, with potential applicability to continuous domains via extensions.

Abstract

Conventional reinforcement learning (RL) methods can successfully solve a wide range of sequential decision problems. However, learning policies that can generalize predictably across multiple tasks in a setting with non-Markovian reward specifications is a challenging problem. We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning. In contrast to other methods that combine (sub)policies via planning, our method asymptotically attains global optimality, even in stochastic environments.

Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

TL;DR

The paper addresses the challenge of solving multiple tasks with non-Markovian rewards by learning a policy basis of subpolicies through successor features. It couples a low-level SF-based MDP family with high-level FSAs to form a product MDP and plans over augmented exit states using SF-FSA-VI, guaranteeing global optimality in stochastic settings when the CCS is optimal. Key contributions include the SF-FSA-VI algorithm, a theoretical optimality guarantee via GPI on a CCS, and empirical validation in Delivery and Office domains showing faster learning and planning than baselines. The approach provides zero-shot generalization to new FSA-described tasks while maintaining interpretability through explicit policy composition and planning over exit states, with potential applicability to continuous domains via extensions.

Abstract

Conventional reinforcement learning (RL) methods can successfully solve a wide range of sequential decision problems. However, learning policies that can generalize predictably across multiple tasks in a setting with non-Markovian reward specifications is a challenging problem. We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning. In contrast to other methods that combine (sub)policies via planning, our method asymptotically attains global optimality, even in stochastic environments.
Paper Structure (23 sections, 1 theorem, 16 equations, 6 figures, 4 algorithms)

This paper contains 23 sections, 1 theorem, 16 equations, 6 figures, 4 algorithms.

Key Result

Theorem 1

Let $\Pi$ be a set of policies such that the set of their expected SFs, $\Psi=\{\boldsymbol{\psi}^\pi\}_{\pi\in\Pi}$, constitutes a CCS. Then, given any weight vector $\mathbf{w}\in\mathbb{R}^d$, the GPI policy $\pi_\mathbf{w}^{GPI}(s) \in \arg \max_{a\in A} \max_{\pi\in\Pi} Q_\mathbf{w}^\pi(s,a)$ i

Figures (6)

  • Figure 1: Depiction of the Office (a) and Delivery (b) environments, FSA task specification of the composite task in the Office domain and the FSA task specificiation of the sequential task in the Delivery domain (b). In (a) $\mathcal{P}=\{\text{{\color{black}\Coffeecup}\xspace}, \text{\Letter}, o\}$ and $\mathcal{E}=\{\text{{\color{black}\Coffeecup}\xspace}^1,\text{{\color{black}\Coffeecup}\xspace}^2, \text{\Letter}^1,\text{\Letter}^2, o^1, o^2\}$. In (b), $\mathcal{E}=\mathcal{P}=\{A, B, C, H\}$.
  • Figure 2: Disjunction (a) and composite (b) FSA task specifications for the Office domain.
  • Figure 3: Experimental results for learning (Delivery, top-left and Office, bottom-left) and compositionality (Delivery, top-right and Office, bottom-right). Results show the average performance and standard deviation over the three tasks and 5 seeds per task.
  • Figure 4: Double Slit environment (left) and FSA task specification to reach either goal locations blue or red.
  • Figure 5: Finite state automatons for the Office domain (sequential (a), disjunction (b) and composite (c)) tasks.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1: Alegre, Bazzan, and Silva, 2022