How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

Reabetswe M. Nkhumise; Debabrota Basu; Tony J. Prescott; Aditya Gilra

How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

Reabetswe M. Nkhumise, Debabrota Basu, Tony J. Prescott, Aditya Gilra

TL;DR

Through empirical analyses across various environments and algorithms, it is demonstrated that ESL and OMR provide insights into the exploration processes of RL algorithms and hardness of different tasks in discrete and continuous MDPs.

Abstract

The rising successes of RL are propelled by combining smart algorithmic strategies and deep architectures to optimize the distribution of returns and visitations over the state-action space. A quantitative framework to compare the learning processes of these eclectic RL algorithms is currently absent but desired in practice. We address this gap by representing the learning process of an RL algorithm as a sequence of policies generated during training, and then studying the policy trajectory induced in the manifold of state-action occupancy measures. Using an optimal transport-based metric, we measure the length of the paths induced by the policy sequence yielded by an RL algorithm between an initial policy and a final optimal policy. Hence, we first define the 'Effort of Sequential Learning' (ESL). ESL quantifies the relative distance that an RL algorithm travels compared to the shortest path from the initial to the optimal policy. Further, we connect the dynamics of policies in the occupancy measure space and regret (another metric to understand the suboptimality of an RL algorithm), by defining the 'Optimal Movement Ratio' (OMR). OMR assesses the fraction of movements in the occupancy measure space that effectively reduce an analogue of regret. Finally, we derive approximation guarantees to estimate ESL and OMR with finite number of samples and without access to an optimal policy. Through empirical analyses across various environments and algorithms, we demonstrate that ESL and OMR provide insights into the exploration processes of RL algorithms and hardness of different tasks in discrete and continuous MDPs.

How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

TL;DR

Abstract

Paper Structure (40 sections, 5 theorems, 73 equations, 16 figures, 13 tables)

This paper contains 40 sections, 5 theorems, 73 equations, 16 figures, 13 tables.

Introduction
Preliminaries
RL Algorithms as Trajectories of Occupancy Measures
Effort of Sequential Learning (ESL)
Optimal Movement Ratio (OMR)
Extension to Finite-Horizon Episodic Setting
Computational Challenges and Solutions
Policy datasets for computing occupancy measures
When an optimal policy is not reached
Experimental Evaluation
Exploration Trajectories of RL Algorithms
Comparison of ESL and OMR across RL Algorithms and Environments
ESL Increases with Task Difficulty
Related Works
Discussion and Future Works
...and 25 more sections

Key Result

Proposition 1

If the policy $\pi$ has a smooth parameterization $\theta$ and the inverse of $P^{\pi}(s,s') \triangleq \sum_{a}T(s \mid s',a)\pi(a \mid s')$ exists, then the space of occupancy measures $\mathcal{M}$ is a differentiable manifold. (Proof in Appendix Appendix prop 1)

Figures (16)

Figure 1: Schematic of the policy trajectory $C$ in the space of occupancy measures $\mathcal{M}$ during RL training (solid line) vs. the geodesic $L$ (shortest path, dashed line) between the initial and final points (i.e. $\pi_{0}$ and $\pi_{N} = \pi^{*}$).
Figure 2: Schematic of how distance-to-optimal (denoted by $x_{k}$) and stepwise-distance (denoted by $y_{k}$) on the occupancy measure space describe exploratory process of an RL algorithm during training.
Figure 3: The top row showcases 3D scatter plots of distance-to-optimal (x-axis) and stepwise-distance (y-axis) across number of updates (z-axis), illustrating policy evolution in the occupancy measure space for RL algorithms: $\epsilon($=0)-greedy and $\epsilon($=1)-greedy Q-learning, UCRL2, PSRL, SAC, and DQN (left to right). The bottom row depicts the corresponding state visitation frequencies over the full training. The problem setting is deterministic with dense-rewards and 15 maximum number of steps per episode. (Larger versions of these plots are in Appendix \ref{['More Results']})
Figure 4: Top row: 3D scatter plots of distance-to-optimal and stepwise-distance over number of updates for algorithms DDPG and SAC. Bottom row: OMR($k$) versus #update, $k$, for the corresponding algorithms.
Figure 5: Q-learning with $\epsilon$-greedy ($\epsilon$ = 0.9 decaying, averaged over 40 runs) across deterministic 2D-Gridworld (5x5 and 15x15) tasks. The 1st and 4th (from left to right) have dense rewards, while the rest have sparse rewards (details in Appendix \ref{['Environment Description']}).
...and 11 more figures

Theorems & Definitions (8)

Proposition 1: Properties of $\mathcal{M}$
Definition 1: Effort of Learning
Definition 2: Effort of Sequential Learning (ESL)
Proposition 2: Regret and Occupancy Measures
Definition 3: Optimal Movement Ratio (OMR)
Proposition 3: Properties of $\mathcal{M}^H$
Proposition 4: Upper Bound on Estimation Error
Proposition 5

How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

TL;DR

Abstract

How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (8)