Table of Contents
Fetching ...

Light-weight probing of unsupervised representations for Reinforcement Learning

Wancong Zhang, Anthony GX-Chen, Vlad Sobal, Yann LeCun, Nicolas Carion

TL;DR

The paper tackles the challenge of evaluating unsupervised visual representations for reinforcement learning without expensive RL trials. It introduces reward probing and expert-action probing as lightweight linear evaluation tasks on frozen representations, and demonstrates a strong rank correlation between reward probing and downstream RL performance on Atari 100k. By systematically varying transition models, SSL objectives, and encoder sizes, it shows that forward-model expressiveness and latent dynamics significantly influence representation quality, with BarLow variants and latent GRUs yielding strong results. The findings underscore the practical value of proxy probing to guide pretraining design while highlighting that probing cannot fully replace direct RL evaluation but can dramatically accelerate SSL development for RL. This framework thus enables more efficient exploration of pretraining recipes and highlights key design choices that boost RL performance through better unsupervised representations.

Abstract

Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. Inspired by the vision community, we study whether linear probing can be a proxy evaluation task for the quality of unsupervised RL representation. Specifically, we probe for the observed reward in a given state and the action of an expert in a given state, both of which are generally applicable to many RL domains. Through rigorous experimentation, we show that the probing tasks are strongly rank correlated with the downstream RL performance on the Atari100k Benchmark, while having lower variance and up to 600x lower computational cost. This provides a more efficient method for exploring the space of pretraining algorithms and identifying promising pretraining recipes without the need to run RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective.

Light-weight probing of unsupervised representations for Reinforcement Learning

TL;DR

The paper tackles the challenge of evaluating unsupervised visual representations for reinforcement learning without expensive RL trials. It introduces reward probing and expert-action probing as lightweight linear evaluation tasks on frozen representations, and demonstrates a strong rank correlation between reward probing and downstream RL performance on Atari 100k. By systematically varying transition models, SSL objectives, and encoder sizes, it shows that forward-model expressiveness and latent dynamics significantly influence representation quality, with BarLow variants and latent GRUs yielding strong results. The findings underscore the practical value of proxy probing to guide pretraining design while highlighting that probing cannot fully replace direct RL evaluation but can dramatically accelerate SSL development for RL. This framework thus enables more efficient exploration of pretraining recipes and highlights key design choices that boost RL performance through better unsupervised representations.

Abstract

Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. Inspired by the vision community, we study whether linear probing can be a proxy evaluation task for the quality of unsupervised RL representation. Specifically, we probe for the observed reward in a given state and the action of an expert in a given state, both of which are generally applicable to many RL domains. Through rigorous experimentation, we show that the probing tasks are strongly rank correlated with the downstream RL performance on the Atari100k Benchmark, while having lower variance and up to 600x lower computational cost. This provides a more efficient method for exploring the space of pretraining algorithms and identifying promising pretraining recipes without the need to run RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective.
Paper Structure (38 sections, 8 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 38 sections, 8 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: Left: Correlation between the SSL representations' abilities to linearly predict the presence of reward in a given state, versus RL performance using the same representations, measured as the interquartile mean of the human-normalized score (HNS) over 9 Atari games. Each point denotes a separate SSL pretraining method. A linear line of best fit is shown with 95 confidence interval. We compute Spearman's rank correlation coefficient (Spearman's r) and determine its statistical significance using permutation testing (with $n=50000$). Right: When comparing two models, the reward probing score can give low variance reliable estimates of RL performance, while direct RL evaluation may require many seeds to reach meaningful differences in mean performance.
  • Figure 2: Model diagram. The observations consist of a stack of 4 frames, to which we apply data augmentation before passing them to a convolutional encoder. The predictor is a recurrent model outputting future state embeddings given the action. We supervise with an inverse modeling loss (cross entropy loss on the predicted transition action) and an SSL loss (distance between embeddings)
  • Figure 3: Decoding results, using a de-convolutional model to predict the pixel values from frozen state representations. Both games exhibit stochastic behaviours. In Demon attack, both models fail to capture the position of the enemies. In Gopher, the enemy (circled in red) is moving randomly, but thanks to the latent variable, the GRU-latent model is able to predict a possible position, while the deterministic model regresses to the mean.
  • Figure 5: Correlation between the SSL representations' abilities to linearly predict (Left) presence of immediate reward and (Right) action, versus RL performance using the same representations, measured as the interquartile mean of the human-normalized score (HNS) over 9 Atari games. Each point denotes a separate SSL pretraining method. A linear line of best fit is shown with 95 confidence interval. We compute Spearman's rank correlation coefficient (Spearman's r) and determine its statistical significance using permutation testing (with $n=50000$). Compared to Fig. \ref{['fig:corr_coeff']}, we added one extra model which obtained poor probing results to demonstrate that the correlations holds for a wide range of performance levels.
  • Figure 6: Reproduction of Fig.\ref{['fig:reward_action_corr_coeff']}, left, on a different probing dataset (expert trajectories instead of random ones). The exact values of the F1 scores are different, but the Spearman's r is the same, showing that the correlation is insensitive to the probing dataset
  • ...and 3 more figures