Table of Contents
Fetching ...

Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning

Samuel Garcin, Trevor McInroe, Pablo Samuel Castro, Prakash Panangaden, Christopher G. Lucas, David Abel, Stefano V. Albrecht

TL;DR

The paper addresses how to learn effective actor and critic representations in on-policy reinforcement learning by introducing mutual information-based metrics to quantify specialization in decoupled versus shared architectures. It shows that decoupled representations lead the actor to capture action-relevant information while the critic encodes value and dynamics, supported by both theoretical characterizations and empirical results on PPO, PPG, and DCPG across Procgen and Brax. A key finding is that the critic can influence data collection and exploration, and that representation objectives must be chosen carefully to avoid bias that hinders convergence to the optimal policy. The work provides practical guidance for designing representation learning objectives and suggests that promoting level-invariant information in the actor, as well as exploiting the critic’s influence on exploration, can improve sample efficiency and generalization in complex environments.

Abstract

Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents. Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for the actor and for the critic in on-policy algorithms. We focus our study on understanding whether the actor and critic will benefit from separate, rather than shared, representations. Our primary finding is that when separated, the representations for the actor and critic systematically specialise in extracting different types of information from the environment -- the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. We conduct a rigourous empirical study to understand how different representation learning approaches affect the actor and critic's specialisations and their downstream performance, in terms of sample efficiency and generation capabilities. Finally, we discover that a separated critic plays an important role in exploration and data collection during training. Our code, trained models and data are accessible at https://github.com/francelico/deac-rep.

Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning

TL;DR

The paper addresses how to learn effective actor and critic representations in on-policy reinforcement learning by introducing mutual information-based metrics to quantify specialization in decoupled versus shared architectures. It shows that decoupled representations lead the actor to capture action-relevant information while the critic encodes value and dynamics, supported by both theoretical characterizations and empirical results on PPO, PPG, and DCPG across Procgen and Brax. A key finding is that the critic can influence data collection and exploration, and that representation objectives must be chosen carefully to avoid bias that hinders convergence to the optimal policy. The work provides practical guidance for designing representation learning objectives and suggests that promoting level-invariant information in the actor, as well as exploiting the critic’s influence on exploration, can improve sample efficiency and generalization in complex environments.

Abstract

Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents. Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for the actor and for the critic in on-policy algorithms. We focus our study on understanding whether the actor and critic will benefit from separate, rather than shared, representations. Our primary finding is that when separated, the representations for the actor and critic systematically specialise in extracting different types of information from the environment -- the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. We conduct a rigourous empirical study to understand how different representation learning approaches affect the actor and critic's specialisations and their downstream performance, in terms of sample efficiency and generation capabilities. Finally, we discover that a separated critic plays an important role in exploration and data collection during training. Our code, trained models and data are accessible at https://github.com/francelico/deac-rep.

Paper Structure

This paper contains 17 sections, 10 theorems, 26 equations, 13 figures, 6 tables.

Key Result

theorem 3.1

The difference in returns achieved in train levels and under the full distribution, or generalisation error, has an upper bound that depends on $\text{I}(Z_A;L)$, with where $c\sim \mathcal{U}(L)$ indicates $c$ is sampled uniformly over levels in $L$, $D$ is a constant such that $|V^\pi (x)| \leq D/2,\forall x,\pi$ and $Z_A$ is the output space of the actor's learned representation.

Figures (13)

  • Figure 1: Models with shared (left) and decoupled representations (right).
  • Figure 2: (Top) the initial observations and state spaces of three levels from the assembly line environment in §\ref{['sec:specialization-theory']}. (Bottom) the reduced MDPs spanned by $\phi^*_A$ and $\phi^*_C$.
  • Figure 3: Mean and 95% confidence interval aggregates of $I(Z;\cdot)$/$I(O;\cdot)$ (top/bottom rows) in Procgen. Gray bars indicate $I(Z;\cdot)$/$I(O;\cdot)$ for a shared $\phi$. Blue and orange bars indicate $I(Z;\cdot)$ measured for $\phi_A$ and $\phi_C$ when employing a decoupled architecture. Pink bars indicate $I(O;\cdot)$ measured when using a decoupled architecture. X-axes are shared across top and bottom. For all algorithms, decoupling induces specialisation consistent with §\ref{['sec:specialization-theory']}.
  • Figure 4: Effect of parameter scaling in coupled (blue) and decoupled (orange) PPO. Scores normalized by model performance at 0.6M parameters.
  • Figure 5: Mean and 95% confidence intervals of $I(Z;\cdot)$/$I(O;\cdot)$ (top/bottom) for actor (blue) and critic (orange) representations in Procgen. Information measured from agent observations shown in pink. X-axes are shared across top and bottom. Auxiliary tasks shown are MICo, dynamics prediction (D), and data augmentation (Dr) applied to the actor (A).
  • ...and 8 more figures

Theorems & Definitions (15)

  • theorem 3.1
  • theorem 3.2
  • definition 4.1
  • lemma 1
  • lemma 2
  • theorem A.1
  • proof
  • theorem A.1
  • theorem A.1
  • proof
  • ...and 5 more