Representation-Driven Reinforcement Learning

Ofir Nabati; Guy Tennenholtz; Shie Mannor

Representation-Driven Reinforcement Learning

Ofir Nabati, Guy Tennenholtz, Shie Mannor

TL;DR

RepRL introduces a representation-driven reinforcement learning framework that maps policies to a low-dimensional latent space, enabling $v(\pi) = \langle f(\pi), w\rangle$ to hold and allowing contextual-bandit algorithms to guide exploration. By learning the representation via variational inference and constructing decision sets in policy or latent space, RepRL reframes exploration-exploitation as a representation-exploitation problem. The framework is instantiated in RepRL-ES and RepRL-PG and validated on MuJoCo and MinAtar, with notable gains in sparse-reward settings, demonstrating the primacy of policy representation in efficient exploration. This work shifts the focus of RL from solely improving optimization in policy space to shaping representation quality as a lever for exploration efficiency, suggesting several avenues for future integration with large-scale pretraining and broader bandit formulations.

Abstract

We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies.

Representation-Driven Reinforcement Learning

TL;DR

RepRL introduces a representation-driven reinforcement learning framework that maps policies to a low-dimensional latent space, enabling

to hold and allowing contextual-bandit algorithms to guide exploration. By learning the representation via variational inference and constructing decision sets in policy or latent space, RepRL reframes exploration-exploitation as a representation-exploitation problem. The framework is instantiated in RepRL-ES and RepRL-PG and validated on MuJoCo and MinAtar, with notable gains in sparse-reward settings, demonstrating the primacy of policy representation in efficient exploration. This work shifts the focus of RL from solely improving optimization in policy space to shaping representation quality as a lever for exploration efficiency, suggesting several avenues for future integration with large-scale pretraining and broader bandit formulations.

Abstract

Paper Structure (31 sections, 1 theorem, 16 equations, 8 figures, 6 algorithms)

This paper contains 31 sections, 1 theorem, 16 equations, 8 figures, 6 algorithms.

Introduction
Preliminaries
Linear Bandits
RL as a Linear Bandit Problem
RepRL
Learning Representations for RepRL
Constructing a Decision Set
Policy Space Decision Set.
Latent Space Decision Set.
History-based Decision Set.
Inner trajectory sampling
RepRL Algorithms
Representation Driven Evolution Strategy
Representation Driven Policy Gradient
Experiments
...and 16 more sections

Key Result

Proposition 3.2

For a policy $\pi \in \Pi$, and stationary $\rho^\pi$ we get $\tilde{v}(\pi) = \frac{v(\pi)}{1 - \gamma}.$

Figures (8)

Figure 1: RepRL scheme. Composed of 4 stages: representation of the parameters, constructing a decision set, choosing the best arm using an off-the-shelf linear bandit algorithm, collect data with the chosen policy.
Figure 2: The diagram illustrates the structure of the networks in RepRL. The policy's parameters are fed into the representation network, which acts as a posterior distribution for the policy's latent representation. Sampling from this posterior, the latent representation is used by the bandits algorithm to evaluate the value that encapsulates the exploration-exploitation tradeoff.
Figure 3: The two-dimensional t-SNE visualization depicts the policy representation in the GridWorld experiment. On the right, we observe the learned latent representation, while on the left, we see the direct representation of the policy's weights. Each point in the visualization corresponds to a distinct policy, and the color of each point corresponds to a sample of the policy's value.
Figure 4: GridWorld visualization experiment. Trajectories were averaged across 100 seeds at various times during training, where more recent trajectories have greater opacity. Background colors indicate the level of mean reward.
Figure 5: MuJoCo experiments during training. The results are for the MuJoCo suitcase (top) and the modified sparse MuJoCo (bottom).
...and 3 more figures

Theorems & Definitions (2)

Proposition 3.2
proof

Representation-Driven Reinforcement Learning

TL;DR

Abstract

Representation-Driven Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)