Table of Contents
Fetching ...

Self-supervised network distillation: an effective approach to exploration in sparse reward environments

Matej Pecháč, Michal Chovanec, Igor Farkaš

TL;DR

This work tackles sparse reward exploration in reinforcement learning by introducing Self-supervised Network Distillation (SND), a framework where a learnable target model is trained through self-supervision and its distillation error guides exploration. It generalizes Random Network Distillation by updating the target representation with three self-supervised losses: SND-V (contrastive), SND-STD (ST-DIM-based), and SND-VIC (VICReg-based). Across ten challenging environments (Atari and ProcGen), SND variants yield faster learning and higher external rewards than baselines, with intrinsic rewards that remain informative and do not vanish. The study also provides analytical evidence that richer, decorrelated, and more discriminative feature spaces underlie the observed improvements, suggesting SND as a versatile intrinsic-motivation tool for sparse-reward RL with practical implications for open-ended and autonomous learning.

Abstract

Reinforcement learning can solve decision-making problems and train an agent to behave in an environment according to a predesigned reward function. However, such an approach becomes very problematic if the reward is too sparse and so the agent does not come across the reward during the environmental exploration. The solution to such a problem may be to equip the agent with an intrinsic motivation that will provide informed exploration during which the agent is likely to also encounter external reward. Novelty detection is one of the promising branches of intrinsic motivation research. We present Self-supervised Network Distillation (SND), a class of intrinsic motivation algorithms based on the distillation error as a novelty indicator, where the predictor model and the target model are both trained. We adapted three existing self-supervised methods for this purpose and experimentally tested them on a set of ten environments that are considered difficult to explore. The results show that our approach achieves faster growth and higher external reward for the same training time compared to the baseline models, which implies improved exploration in a very sparse reward environment. In addition, the analytical methods we applied provide valuable explanatory insights into our proposed models.

Self-supervised network distillation: an effective approach to exploration in sparse reward environments

TL;DR

This work tackles sparse reward exploration in reinforcement learning by introducing Self-supervised Network Distillation (SND), a framework where a learnable target model is trained through self-supervision and its distillation error guides exploration. It generalizes Random Network Distillation by updating the target representation with three self-supervised losses: SND-V (contrastive), SND-STD (ST-DIM-based), and SND-VIC (VICReg-based). Across ten challenging environments (Atari and ProcGen), SND variants yield faster learning and higher external rewards than baselines, with intrinsic rewards that remain informative and do not vanish. The study also provides analytical evidence that richer, decorrelated, and more discriminative feature spaces underlie the observed improvements, suggesting SND as a versatile intrinsic-motivation tool for sparse-reward RL with practical implications for open-ended and autonomous learning.

Abstract

Reinforcement learning can solve decision-making problems and train an agent to behave in an environment according to a predesigned reward function. However, such an approach becomes very problematic if the reward is too sparse and so the agent does not come across the reward during the environmental exploration. The solution to such a problem may be to equip the agent with an intrinsic motivation that will provide informed exploration during which the agent is likely to also encounter external reward. Novelty detection is one of the promising branches of intrinsic motivation research. We present Self-supervised Network Distillation (SND), a class of intrinsic motivation algorithms based on the distillation error as a novelty indicator, where the predictor model and the target model are both trained. We adapted three existing self-supervised methods for this purpose and experimentally tested them on a set of ten environments that are considered difficult to explore. The results show that our approach achieves faster growth and higher external reward for the same training time compared to the baseline models, which implies improved exploration in a very sparse reward environment. In addition, the analytical methods we applied provide valuable explanatory insights into our proposed models.
Paper Structure (19 sections, 17 equations, 16 figures, 7 tables)

This paper contains 19 sections, 17 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Self-supervised network distillation (SND) principle. The proposed method consists of two main parts. Top: Self-supervised learning of the suitable features for the target model. Bottom: Calculation of the intrinsic reward by target model distillation, using the squared Euclidean distance between the models' outputs.
  • Figure 2: The basic principle of generating an exploration signal in Random Network Distillation.
  • Figure 3: The basic principle of generating an exploration signal in Self-supervised Network Distillation. The calculation of the intrinsic reward is the same as for RND, but in these methods, the target model changes over time and generates a more complex feature space.
  • Figure 4: Training of the SND target model using two consecutive states and the self-supervised learning algorithm.
  • Figure 5: The scheme of the state augmentation pipeline.
  • ...and 11 more figures