Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

Maciej Wołczyk; Bartłomiej Cupiał; Mateusz Ostaszewski; Michał Bortkiewicz; Michał Zając; Razvan Pascanu; Łukasz Kuciński; Piotr Miłoś

Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

Maciej Wołczyk, Bartłomiej Cupiał, Mateusz Ostaszewski, Michał Bortkiewicz, Michał Zając, Razvan Pascanu, Łukasz Kuciński, Piotr Miłoś

TL;DR

This work identifies forgetting of pre-trained capabilities as a core bottleneck in RL fine-tuning, formalizing two forgetting patterns: state coverage gap and imperfect cloning gap. It demonstrates that knowledge-retention methods (EWC, BC, KS, EM) can substantially mitigate forgetting across NetHack, Montezuma's Revenge, and RoboticSequence, enabling stronger transfer and often surpassing prior state-of-the-art (notably NetHack). The results show environment-specific effectiveness of retention strategies, with KS performing best in NetHack and BC/EM often preferable in other tasks. Overall, the findings advocate incorporating forgetting-mitigation as a standard component of transfer RL pipelines to harness pre-trained capabilities more effectively.

Abstract

Fine-tuning is a widespread technique that allows practitioners to transfer pre-trained capabilities, as recently showcased by the successful applications of foundation models. However, fine-tuning reinforcement learning (RL) models remains a challenge. This work conceptualizes one specific cause of poor transfer, accentuated in the RL setting by the interplay between actions and observations: forgetting of pre-trained capabilities. Namely, a model deteriorates on the state subspace of the downstream task not visited in the initial phase of fine-tuning, on which the model behaved well due to pre-training. This way, we lose the anticipated transfer benefits. We identify conditions when this problem occurs, showing that it is common and, in many cases, catastrophic. Through a detailed empirical analysis of the challenging NetHack and Montezuma's Revenge environments, we show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities. In particular, in NetHack, we achieve a new state-of-the-art for neural models, improving the previous best score from $5$K to over $10$K points in the Human Monk scenario.

Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

TL;DR

Abstract

K to over

K points in the Human Monk scenario.

Paper Structure (53 sections, 8 equations, 30 figures, 6 tables, 1 algorithm)

This paper contains 53 sections, 8 equations, 30 figures, 6 tables, 1 algorithm.

Introduction
Forgetting of pre-trained capabilities
Experimental setup
Main result: knowledge retention mitigates forgetting of pre-trained capabilities
Analysis: forgetting of pre-trained capabilities hinders RL fine-tuning
Related Work
Transfer in RL
Offline to Online Reinforcement Learning
Impact of interdependence between Far and Close
Generalization to multi-task setting
Continual reinforcement learning
Limitations & Conclusions
Toy Examples -- MDP and AppleRetrieval
Two-state MDPs
State coverage gap
...and 38 more sections

Figures (30)

Figure 1: Forgetting of pre-trained capabilities. For illustration, we partition the states of the downstream task into Close and Far, depending on the distance from the starting state; the agent must master Far to reach the goal. In the state coverage gap (top), the pre-trained policy performs perfectly on Far but is suboptimal on Close. During the initial stage of fine-tuning, while mastering Close, the policy deteriorates, often catastrophically, on Far. In imperfect cloning gap (bottom), the pre-trained policy is decent both on Close and Far; however, due to compounding errors in the initial stages of fine-tuning, the agent rarely visits Far, and the policy deteriorates on this part. In both cases, the deteriorated policy on Far is hard to recover and thus necessitates long training to solve the whole task.
Figure 2: Example of state coverage gap. (Left) We assume that a pre-trained model is able to pick and place objects (e.g., the cylinder). However, it does not know how to open drawers. Consider a new task in which the agent needs first to open the drawer (Close states) and then pick and place the object (Far states). (Right) During fine-tuning, the model rapidly forgets how to manipulate objects before learning to open the drawer and struggles to reacquire this skill (dashed blue line). Knowledge retention techniques alleviate this issue (dashed orange line). At the same time, in both cases, the model learns how to open the drawer (solid lines).
Figure 3: Performance on (a) NetHack, (b) Montezuma's Revenge, and (c) RoboticSequence. For NetHack, the FPC is driven by imperfect cloning gap, while for the remaining two by state coverage gap. In all cases, knowledge retention techniques improve the performance of fine-tuning. We omit KS in Montezuma's Revenge and RoboticSequence as it underperforms.
Figure 4: Density plots showing maximum dungeon level achieved compared to the total number of turns (units of in-game time) for expert AutoAscend (left), pre-trained policy $\pi_*$ (center), and fine-tuning + KS (right) Brighter colors indicate higher visitation density. Level visitation of $\pi_*$ differs significantly from the level visitation of the AutoAscend expert. This is an example of imperfect cloning gap as the agent will not see further levels at the start of fine-tuning. The knowledge retention-based method manages to perform well and explore different parts of the state space.
Figure 5: The average return throughout the fine-tuning process on two NetHack tasks: level 4 (top), and Sokoban level (bottom). The result is averaged over 200 episodes, each starting from where the expert (AutoAscend) ended up upon first entering level.
...and 25 more figures

Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

TL;DR

Abstract

Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

Authors

TL;DR

Abstract

Table of Contents

Figures (30)