An advantage based policy transfer algorithm for reinforcement learning with measures of transferability

Md Ferdous Alam; Parinaz Naghizadeh; David Hoelzle

An advantage based policy transfer algorithm for reinforcement learning with measures of transferability

Md Ferdous Alam, Parinaz Naghizadeh, David Hoelzle

TL;DR

The paper tackles sample-inefficient transfer learning in reinforcement learning under fixed-domain environments by introducing APT-RL, an off-policy method built on Soft Actor-Critic that uses advantage-based regularization to weight source knowledge. It eliminates heuristic hyperparameters through an adaptive temperature $eta_t = e^{A^t_S - A^t_T}$ and enables synchronous updates of the source policy with target data, improving data efficiency. In addition, it defines a relative transfer performance metric and develops a model-based task similarity algorithm to predict transferability and align transfer performance with task similarity. Experiments on HalfCheetah-v3, Ant-v3, and Humanoid-v3 show that APT-RL often outperforms baselines and learns as well as or better than learning from scratch in adversarial settings, highlighting practical gains in high-dimensional control tasks.

Abstract

Reinforcement learning (RL) enables sequential decision-making in complex and high-dimensional environments through interaction with the environment. In most real-world applications, however, a high number of interactions are infeasible. In these environments, transfer RL algorithms, which can be used for the transfer of knowledge from one or multiple source environments to a target environment, have been shown to increase learning speed and improve initial and asymptotic performance. However, most existing transfer RL algorithms are on-policy and sample inefficient, fail in adversarial target tasks, and often require heuristic choices in algorithm design. This paper proposes an off-policy Advantage-based Policy Transfer algorithm, APT-RL, for fixed domain environments. Its novelty is in using the popular notion of ``advantage'' as a regularizer, to weigh the knowledge that should be transferred from the source, relative to new knowledge learned in the target, removing the need for heuristic choices. Further, we propose a new transfer performance measure to evaluate the performance of our algorithm and unify existing transfer RL frameworks. Finally, we present a scalable, theoretically-backed task similarity measurement algorithm to illustrate the alignments between our proposed transferability measure and similarities between source and target environments. We compare APT-RL with several baselines, including existing transfer-RL algorithms, in three high-dimensional continuous control tasks. Our experiments demonstrate that APT-RL outperforms existing transfer RL algorithms and is at least as good as learning from scratch in adversarial tasks.

An advantage based policy transfer algorithm for reinforcement learning with measures of transferability

TL;DR

and enables synchronous updates of the source policy with target data, improving data efficiency. In addition, it defines a relative transfer performance metric and develops a model-based task similarity algorithm to predict transferability and align transfer performance with task similarity. Experiments on HalfCheetah-v3, Ant-v3, and Humanoid-v3 show that APT-RL often outperforms baselines and learns as well as or better than learning from scratch in adversarial settings, highlighting practical gains in high-dimensional control tasks.

Abstract

Paper Structure (27 sections, 4 theorems, 25 equations, 6 figures, 3 tables, 3 algorithms)

This paper contains 27 sections, 4 theorems, 25 equations, 6 figures, 3 tables, 3 algorithms.

Introduction
Related work
APT-RL: An off-policy advantage based policy transfer algorithm
Advantage-based policy regularization
Synchronous update of the source policy
An evaluation framework for transfer RL
A measure of transferability
Theoretical support
Revisiting the toy problem
Measuring task similarity
Theoretical motivation
A model-based task similarity measurement algorithm
Experiment Setup
Experiment Results
Task similarity
...and 12 more sections

Key Result

Theorem 1

(Relative transfer performance and policy improvement) Consider $\rho_t^i = \mathbb{E}^{\pi_{i, t}}\left[\sum_{k=0}^H r_k|\mathbf{s}_0\right]$ for policy $\pi_i$ and $\rho_t^b = \mathbb{E}^{\pi_{b, t}}\left[\sum_{k=0}^H r_k|\mathbf{s}_0\right]$ for policy $\pi_b$, where $\mathbf{s}_0$ is the startin

Figures (6)

Figure 1: Knowledge transfer in the four-room toy problem: three tasks are presented in (a), (b) and (c), where $\bullet$ represents the starting state and $\star$ represents the goal state; the goal state is moved further in (b) and (c) when compared to (a), and the doorways are also changed slightly. This makes the target task $\mathcal{T}_1$ in (b) more similar to the source task $\mathcal{S}$ in (a) when compared to the other target task $\mathcal{T}_2$ in (c). (d) An evaluation measure, $\tau$, for transfer learning performance in tasks $\mathcal{T}_1$ and $\mathcal{T}_2$ is shown, which calculates performance at each evaluation episode; (e)-(f) influence of source policy on the target tasks $\mathcal{T}_1$ and $\mathcal{T}_2$ are shown in terms of $e^{A(s, a)}$ where $A(s, a) = Q^*_{\mathcal{T}_i}(s, a) - V^*(s)$ is the advantage function in task $\mathcal{T}_i, \text{ and } i = 1, 2$. Note that the action is selected according to the source policy to calculate the advantage, which demonstrates the effect of the source policy on the target.
Figure 2: Task dissimilarity: Empirical task similarity between several variations of Half-cheetah, Ant, and Humanoid environments
Figure 3: APT-RL transferability, $\Lambda_\text{APT-RL}$: APT-RL is compared against vanilla SAC (learning from scratch), REPAINT, zero-shot policy, and fine-tuned policy. Average return during the evaluation episode is taken as $\rho_t$, meaning $\rho_t = \mathbb{E}^{\pi^*_{\mathcal{T}_i}}[\sum_{t}r_k]$. We do not show Repaint for the humanoid environment as it fails to solve the tasks. Results are shown with one standard deviation range.
Figure 4: Left: Relative transfer performance, $\tau_t$ are shown with corresponding mean similarity scores. Right: Regularization co-efficient, $\beta_t$, is shown for all tasks with corresponding mean similarity scores.
Figure 5: Ablation study of $\beta$ parameter in APT-RL: manual tuning of hyperparameter $\beta$ is shown against APT-RL in the least similar tasks for all three environments.
...and 1 more figures

Theorems & Definitions (8)

Definition 4.1: Single-task transferability
Definition 4.2: Relative transfer performance, $\tau$
Theorem 1
Theorem 2
Theorem 1
proof
Theorem 2
proof

An advantage based policy transfer algorithm for reinforcement learning with measures of transferability

TL;DR

Abstract

An advantage based policy transfer algorithm for reinforcement learning with measures of transferability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (8)