Table of Contents
Fetching ...

Human-Inspired Framework to Accelerate Reinforcement Learning

Ali Beikmohammadi, Sindri Magnússon

TL;DR

The paper tackles RL sample inefficiency by introducing TA-Explore, a human-inspired curriculum that leverages progressively challenging auxiliary tasks with an annealed assistant reward to accelerate learning of the main objective. It defines a sequence of TA^e MDPs through a convex combination of auxiliary and main rewards and demonstrates that, with a decreasing β(e), the agent transfers knowledge from simpler tasks to the primary task in an algorithm-agnostic way. Empirical results on simple Random Walks and challenging linear/nonlinear control problems show faster convergence and robust performance, with no extra computational cost and the ability to transfer either value or policy across RL methods. Limitations include the need to define suitable auxiliary goals and tune β(e); future work proposes self-tuning β and applying the framework to POMDPs and multi-agent scenarios.

Abstract

Reinforcement learning (RL) is crucial for data science decision-making but suffers from sample inefficiency, particularly in real-world scenarios with costly physical interactions. This paper introduces a novel human-inspired framework to enhance RL algorithm sample efficiency. It achieves this by initially exposing the learning agent to simpler tasks that progressively increase in complexity, ultimately leading to the main task. This method requires no pre-training and involves learning simpler tasks for just one iteration. The resulting knowledge can facilitate various transfer learning approaches, such as value and policy transfer, without increasing computational complexity. It can be applied across different goals, environments, and RL algorithms, including value-based, policy-based, tabular, and deep RL methods. Experimental evaluations demonstrate the framework's effectiveness in enhancing sample efficiency, especially in challenging main tasks, demonstrated through both a simple Random Walk and more complex optimal control problems with constraints.

Human-Inspired Framework to Accelerate Reinforcement Learning

TL;DR

The paper tackles RL sample inefficiency by introducing TA-Explore, a human-inspired curriculum that leverages progressively challenging auxiliary tasks with an annealed assistant reward to accelerate learning of the main objective. It defines a sequence of TA^e MDPs through a convex combination of auxiliary and main rewards and demonstrates that, with a decreasing β(e), the agent transfers knowledge from simpler tasks to the primary task in an algorithm-agnostic way. Empirical results on simple Random Walks and challenging linear/nonlinear control problems show faster convergence and robust performance, with no extra computational cost and the ability to transfer either value or policy across RL methods. Limitations include the need to define suitable auxiliary goals and tune β(e); future work proposes self-tuning β and applying the framework to POMDPs and multi-agent scenarios.

Abstract

Reinforcement learning (RL) is crucial for data science decision-making but suffers from sample inefficiency, particularly in real-world scenarios with costly physical interactions. This paper introduces a novel human-inspired framework to enhance RL algorithm sample efficiency. It achieves this by initially exposing the learning agent to simpler tasks that progressively increase in complexity, ultimately leading to the main task. This method requires no pre-training and involves learning simpler tasks for just one iteration. The resulting knowledge can facilitate various transfer learning approaches, such as value and policy transfer, without increasing computational complexity. It can be applied across different goals, environments, and RL algorithms, including value-based, policy-based, tabular, and deep RL methods. Experimental evaluations demonstrate the framework's effectiveness in enhancing sample efficiency, especially in challenging main tasks, demonstrated through both a simple Random Walk and more complex optimal control problems with constraints.
Paper Structure (14 sections, 3 theorems, 15 equations, 8 figures)

This paper contains 14 sections, 3 theorems, 15 equations, 8 figures.

Key Result

Proposition 1

Let tasks $\texttt{TA}^e$, and $T$ are defined by TAe, and T, respectively. Since $R^e$ is a convex combination of $R^A$ and $R^T$ according to eqR, and tasks $A$ and $T$ are $\alpha_{A,T}$-similar by definition Task_Similarity, both tasks $\texttt{TA}^e$, and $T$ also have same state space, action

Figures (8)

  • Figure 1: Random Walk example sutton2018, where (a) describes how to receive the target reward $R^T$ and (b) illustrates how to acquire the assistant reward $R^A$. In both cases, the episode terminates by going to A or E states.
  • Figure 2: The behaviour of the $\beta(e)$ function considered for Random Walk example, with different $\lambda$ values. As $\lambda$ increases, a slower shifting of the agent from auxiliary goal $A$ learning to main goal $T$ learning happens.
  • Figure 3: The task similarity between $\texttt{TA}$ and the main task $T$ at episode $e$ (i.e., $\alpha_{\texttt{TA}^e,T}$) has been computed for the Random Walk example, considering various numbers of states and $\lambda$ values. Notably, all propositions \ref{['proposition1']}, \ref{['proposition2']}, and \ref{['proposition3']} hold true, demonstrating that $\alpha_{A,T} \leq \alpha_{\texttt{TA}^e,T} \leq 1$ and $\lim_{e\rightarrow\infty} \alpha_{\texttt{TA}^e,T} = 1$.
  • Figure 4: The main goal learning curves in the Random Walk example for different $\lambda$ values and the different number of states. The performance measure shown is the RMS error between the value function learned considering the assumed rewards (i.e., only $R^T$, only $R^A$, or TA-Explore with different $\lambda$s) and the true value function, which is averaged over the states and then averaged over 100 runs. As it turns out, in all cases, by choosing the appropriate $\lambda$, one can be sure that TA-Explore is learning faster. The undrawn curves in (b) and (c) have diverged and have been omitted to show the rest of the results in more detail.
  • Figure 5: The task similarity between $\texttt{TA}$ and the main task $T$ at episode $e$ (i.e., $\alpha_{\texttt{TA}^e,T}$) has been computed for optimal temperate control problem with constraint, considering different weighting values to the control objective (i.e., (a) $1\Vert a\Vert ^2$, (b) $10\Vert a\Vert ^2$, (c) $100\Vert a\Vert ^2$).
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 1: Task Similarity
  • Definition 2: Task Simplicity
  • Definition 3: $\beta(e)$
  • Remark 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Remark 2
  • Remark 3