A Method for Fast Autonomy Transfer in Reinforcement Learning

Dinuka Sahabandu; Bhaskar Ramasubramanian; Michail Alexiou; J. Sukarno Mertoguno; Linda Bushnell; Radha Poovendran

A Method for Fast Autonomy Transfer in Reinforcement Learning

Dinuka Sahabandu, Bhaskar Ramasubramanian, Michail Alexiou, J. Sukarno Mertoguno, Linda Bushnell, Radha Poovendran

TL;DR

The paper addresses rapid autonomy transfer in reinforcement learning by reusing pre-trained critic value functions from multiple environments. It introduces the Multi-Critic Actor-Critic (MCAC) algorithm, which forms a weighted ensemble of $N$ pre-trained critics to approximate the current environment's value function via $\hat{V}(s)=\sum_{i=1}^N w_i\bar{V}_i(s)$ with weights on the probability simplex. Weights are updated on a faster time-scale using a TD-error-driven rule, while the actor's policy updates occur on a slower time-scale, enabling stable convergence. Empirical results on two grid-world case studies show MCAC achieves up to $22.76\times$ faster autonomy transfer and higher rewards than a baseline actor-critic, highlighting the practical impact of cross-environment knowledge transfer in RL.

Abstract

This paper introduces a novel reinforcement learning (RL) strategy designed to facilitate rapid autonomy transfer by utilizing pre-trained critic value functions from multiple environments. Unlike traditional methods that require extensive retraining or fine-tuning, our approach integrates existing knowledge, enabling an RL agent to adapt swiftly to new settings without requiring extensive computational resources. Our contributions include development of the Multi-Critic Actor-Critic (MCAC) algorithm, establishing its convergence, and empirical evidence demonstrating its efficacy. Our experimental results show that MCAC significantly outperforms the baseline actor-critic algorithm, achieving up to 22.76x faster autonomy transfer and higher reward accumulation. This advancement underscores the potential of leveraging accumulated knowledge for efficient adaptation in RL applications.

A Method for Fast Autonomy Transfer in Reinforcement Learning

TL;DR

pre-trained critics to approximate the current environment's value function via

with weights on the probability simplex. Weights are updated on a faster time-scale using a TD-error-driven rule, while the actor's policy updates occur on a slower time-scale, enabling stable convergence. Empirical results on two grid-world case studies show MCAC achieves up to

faster autonomy transfer and higher rewards than a baseline actor-critic, highlighting the practical impact of cross-environment knowledge transfer in RL.

Abstract

Paper Structure (12 sections, 3 theorems, 10 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 3 theorems, 10 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
MDPs and RL
Stochastic Approximation (SA) Algorithms
The Multi-Critic Actor Critic
Experiments
Case Study Environments
Experiment Setup
Metrics
Results
Conclusion

Key Result

Proposition 1

Consider an SA algorithm in the following form defined over a set of parameters $x \in \mathcal{R}^{m_x}$ and a continuous function $h: \mathcal{R}^{m_x} \rightarrow \mathcal{R}^{m_x}$. where $\Theta$ is a projection operator that projects each $x^{t}$ iterate onto a compact and convex set $\Lambda \in \mathcal{R}^{m_x}$ and $\kappa^{t}$ is a bounded random sequence. Let the ODE associated with E

Figures (3)

Figure 1: Case Study 1 Setup: This figure shows a $5 \times 5$ grid. The blue (green) color state is the start (goal) state and the grey shaded states represent obstacles. The top row shows the configurations of obstacles that are used to obtain the pretrained critics. The bottom row shows the obstacle configurations on which our MCAC algorithm is evaluated. The four deployment scenarios in the bottom row are obtained by combining the pretrained scenarios in the top row in different ways.
Figure 2: Case Study 2 Setup: This figure shows a $16 \times 16$ grid. The blue (green) color state is the start (goal) state and the grey shaded states represent obstacles. The top row shows the configurations of obstacles that are used to obtain the pretrained critics. The bottom row shows the obstacle configurations on which our MCAC algorithm is evaluated. The four deployment scenarios in the bottom row are obtained by combining the pretrained scenarios in the top row in different ways.
Figure 3: This figure compares the baseline actor-critic (AC) algorithm and our multi-critic actor critic (MCAC) algorithm in terms of average total reward (left column) and average number of steps to reach the goal (right column) in Deployment Scenario 2 for Case Study 1. Shaded regions indicate the variance. MCAC consistently achieves higher average reward and does so in significantly fewer episodes. Using MCAC also results in smaller variance compared to the baseline AC algorithm. Our MCAC algorithm also achieves up to 10.44x speedup.

Theorems & Definitions (5)

Definition 1
Proposition 1: kushner2012stochasticmetivier1984applications
Remark 1
Theorem 2
Theorem 3

A Method for Fast Autonomy Transfer in Reinforcement Learning

TL;DR

Abstract

A Method for Fast Autonomy Transfer in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)