Table of Contents
Fetching ...

What is the role of memorization in Continual Learning?

Jędrzej Kozal, Jan Wasilewski, Alif Ashrafee, Bartosz Krawczyk, Michał Woźniak

TL;DR

This work investigates the role of memorization in continual learning, distinguishing memorization from forgetting and introducing a computable memorization score and a cheaper training-time proxy. It shows that increasing the number of classes elevates memorization and that high-memorization samples are more prone to forgetting under distribution shifts, while memorization is still necessary for high performance. The authors propose Memorization-aware Experience Replay to leverage memorization during incremental training and demonstrate, across standard CL benchmarks and larger buffers, that memory-aware strategies yield improvements, especially when memory capacity is flexible. The study highlights implications for CL benchmark design and outlines future directions to localize memory-encoding components in networks and to develop robust incremental memorization measures.

Abstract

Memorization impacts the performance of deep learning algorithms. Prior works have studied memorization primarily in the context of generalization and privacy. This work studies the memorization effect on incremental learning scenarios. Forgetting prevention and memorization seem similar. However, one should discuss their differences. We designed extensive experiments to evaluate the impact of memorization on continual learning. We clarified that learning examples with high memorization scores are forgotten faster than regular samples. Our findings also indicated that memorization is necessary to achieve the highest performance. However, at low memory regimes, forgetting regular samples is more important. We showed that the importance of a high-memorization score sample rises with an increase in the buffer size. We introduced a memorization proxy and employed it in the buffer policy problem to showcase how memorization could be used during incremental training. We demonstrated that including samples with a higher proxy memorization score is beneficial when the buffer size is large.

What is the role of memorization in Continual Learning?

TL;DR

This work investigates the role of memorization in continual learning, distinguishing memorization from forgetting and introducing a computable memorization score and a cheaper training-time proxy. It shows that increasing the number of classes elevates memorization and that high-memorization samples are more prone to forgetting under distribution shifts, while memorization is still necessary for high performance. The authors propose Memorization-aware Experience Replay to leverage memorization during incremental training and demonstrate, across standard CL benchmarks and larger buffers, that memory-aware strategies yield improvements, especially when memory capacity is flexible. The study highlights implications for CL benchmark design and outlines future directions to localize memory-encoding components in networks and to develop robust incremental memorization measures.

Abstract

Memorization impacts the performance of deep learning algorithms. Prior works have studied memorization primarily in the context of generalization and privacy. This work studies the memorization effect on incremental learning scenarios. Forgetting prevention and memorization seem similar. However, one should discuss their differences. We designed extensive experiments to evaluate the impact of memorization on continual learning. We clarified that learning examples with high memorization scores are forgotten faster than regular samples. Our findings also indicated that memorization is necessary to achieve the highest performance. However, at low memory regimes, forgetting regular samples is more important. We showed that the importance of a high-memorization score sample rises with an increase in the buffer size. We introduced a memorization proxy and employed it in the buffer policy problem to showcase how memorization could be used during incremental training. We demonstrated that including samples with a higher proxy memorization score is beneficial when the buffer size is large.

Paper Structure

This paper contains 30 sections, 3 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: The impact of data and architecture on memorization scores. (Left) histogram of memorization scores for different number classes in the training dataset. (Middle) the depepence between memorization score and number of samples in the dataset. (Right) histogram of memorization scores for different architectures.
  • Figure 2: Memorization scores for different model widths
  • Figure 3: Task accuracy for test set (solid line) and long tail (dotted line) across incremental training on Seq-Cifar100 stream with 10 tasks. (Left) training with buffer size 500. (Middle) training with full access to previous tasks. (Right) training with LwF. Results averaged over 5 runs.
  • Figure 4: Linear probe accuracy for task 5 during incremental training with SGD on Seq-Cifar100. Results averaged over 5 runs.
  • Figure 5: Correlation of training iteration with memorization score.
  • ...and 8 more figures