Slow and Steady Wins the Race: Maintaining Plasticity with Hare and Tortoise Networks

Hojoon Lee; Hyeonseo Cho; Hyunseung Kim; Donghu Kim; Dugki Min; Jaegul Choo; Clare Lyle

Slow and Steady Wins the Race: Maintaining Plasticity with Hare and Tortoise Networks

Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, Clare Lyle

TL;DR

The paper investigates why neural networks lose generalization when trained with warm-starting and shows that improving trainability alone does not restore generalization in modern architectures. It introduces Hare & Tortoise, a dual-network system where a fast-learning Hare is periodically reset to a slow-moving Tortoise via an exponential moving average, thereby decoupling plasticity from knowledge retention. The method improves generalization in warm-start, continual learning, and RL settings (e.g., Atari-100k), often outperforming reinitialization-based baselines and standard regularizers. This approach offers a practical route to maintain plasticity without erasing valuable prior knowledge, with implications for data-efficient learning in large-scale models.

Abstract

This study investigates the loss of generalization ability in neural networks, revisiting warm-starting experiments from Ash & Adams. Our empirical analysis reveals that common methods designed to enhance plasticity by maintaining trainability provide limited benefits to generalization. While reinitializing the network can be effective, it also risks losing valuable prior knowledge. To this end, we introduce the Hare & Tortoise, inspired by the brain's complementary learning system. Hare & Tortoise consists of two components: the Hare network, which rapidly adapts to new information analogously to the hippocampus, and the Tortoise network, which gradually integrates knowledge akin to the neocortex. By periodically reinitializing the Hare network to the Tortoise's weights, our method preserves plasticity while retaining general knowledge. Hare & Tortoise can effectively maintain the network's ability to generalize, which improves advanced reinforcement learning algorithms on the Atari-100k benchmark. The code is available at https://github.com/dojeon-ai/hare-tortoise.

Slow and Steady Wins the Race: Maintaining Plasticity with Hare and Tortoise Networks

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 30 sections, 6 equations, 9 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Loss of Plasticity
Complementary Learning System
Investigating the Effect of Warm-Starting on Neural Network Generalization
Experimental Setup
Effects of Warm-Starting on Generalizability
Effects of Optimizer on Generalizability
Enhancing Trainability is Insufficient for Maintaining Generalizability
Method
Hare & Tortoise Architecture
Training Process
Implementation
Experiments
Warm-Starting
...and 15 more sections

Figures (9)

Figure 1: Hare & Tortoise architecture. The Hare Network rapidly updates its weights for new data, while the Tortoise Network slowly integrates knowledge through an exponential moving average (ema) of the Hare's weights. Periodic reinitialization of the Hare Network to the Tortoise Network's weights ensures a balance between fast, fleeting adaptation and slow, steady generalization.
Figure 2: Impact of Warm-Starting on Generalization.(a) Shows a negative correlation between test accuracy and subset ratio without label noise. (b) Presents a negative correlation between test accuracy and label noise ratio, with a full dataset. (c) Presents the combined impact, indicating both reduced data size and increased label noise detrimentally affect generalization.
Figure 3: Effect of Optimizer Parameters. We observed marginal improvements with varying $\beta_1$ and $\beta_2$. Larger $\epsilon$ alleviate generalization loss but are insufficient to address it entirely.
Figure 4: Comparison of Existing Methods. This figure presents a comparative analysis of test accuracies for different methods applied to networks warm-started with a 10% subset ratio and 50% label noise. Dashed lines indicate the performance of a warm-started network (lower bound) and a fresh network without warm-starting (upper bound). Generalizability methods (L2, Aug) are marked in green, Trainability methods (Spectral, Regen, ReDo, CReLU) in red, and Re-initialization methods (Head Reset, Shrink & Perturb) in blue.
Figure 5: Warm-Starting Results. This graph presents the effectiveness of Hare & Tortoise in warm-starting experiments, compared to EMA, Self-Distillation, and Re-initialization methods. Hare & Tortoise shows superior performance in CIFAR-10 (ResNet-18) and CIFAR-100 (ViT-Tiny), while reinitialization shows greater effectiveness in Tiny ImageNet (VGG-16) with severe generalization loss.
...and 4 more figures

Slow and Steady Wins the Race: Maintaining Plasticity with Hare and Tortoise Networks

TL;DR

Abstract

Slow and Steady Wins the Race: Maintaining Plasticity with Hare and Tortoise Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (9)