Table of Contents
Fetching ...

Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models

Xinyu Liu, Runsong Zhao, Pengcheng Huang, Chunyang Xiao, Bei Li, Jingang Wang, Tong Xiao, Jingbo Zhu

TL;DR

This work proposes a new method called forgetting curve to measure the memorization capability of long-context models and shows that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompt and can be applied to any model size.

Abstract

Numerous recent works target to extend effective context length for language models and various methods, tasks and benchmarks exist to measure model's effective memorization length. However, through thorough investigations, we find limitations for currently existing evaluations on model's memorization capability. We provide an extensive survey for limitations in this work and propose a new method called forgetting curve to measure the memorization capability of long-context models. We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompts and can be applied to any model size. We apply our forgetting curve to a large variety of models involving both transformer and RNN/SSM based architectures. Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models. We also examine the difference between our measurement and existing benchmarks as well as popular metrics for various models. Our code and results can be found at https://github.com/1azybug/ForgettingCurve.

Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models

TL;DR

This work proposes a new method called forgetting curve to measure the memorization capability of long-context models and shows that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompt and can be applied to any model size.

Abstract

Numerous recent works target to extend effective context length for language models and various methods, tasks and benchmarks exist to measure model's effective memorization length. However, through thorough investigations, we find limitations for currently existing evaluations on model's memorization capability. We provide an extensive survey for limitations in this work and propose a new method called forgetting curve to measure the memorization capability of long-context models. We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompts and can be applied to any model size. We apply our forgetting curve to a large variety of models involving both transformer and RNN/SSM based architectures. Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models. We also examine the difference between our measurement and existing benchmarks as well as popular metrics for various models. Our code and results can be found at https://github.com/1azybug/ForgettingCurve.
Paper Structure (32 sections, 1 equation, 23 figures, 5 tables)

This paper contains 32 sections, 1 equation, 23 figures, 5 tables.

Figures (23)

  • Figure 1: The forgetting curve of Llama-2-base-32k llama2_32k. The x-axis denotes the prefix length. Green, blue, and red areas respectively indicate fine-grained memory where the model achieves 99% token replication accuracy (except for very short sequences), coarse-grained memory where copy accuracy surpasses LM accuracy, and the amnesia area where the model completely ignores the prefix.
  • Figure 2: The forgetting curve task measures the LLM prediction accuracy for the target sequence "This is a toy task for testing memory" under two settings. The above figure illustrates the copy setting, while the below one shows the language modelling setting. We calculate the difference between these two settings to obtain the forgetting curve reflecting model's memory behaviour. As shown in the figure, only the later half of the tokens are taken into account to construct the forgetting curve.
  • Figure 3: Llama-2-7b forgetting curve with various text sources. (a) Various irrelevant text sources, and copy text is sourced from pg19 test set. (b) Various copy text sources, and irrelevant text is sourced from pg19 test set.
  • Figure 4: The forgetting curves for the Llama model with 83M parameters. The solid and dashed lines represent copy accuracy and language modeling accuracy respectively. The model is trained on the PG-19 training dataset, and the forgetting curves are plotted using both the PG-19 training and test datasets.
  • Figure 5: The forgetting curve (top) and perplexity (bottom) for the Llama-XL, which is trained and tested on the PG-19 dataset.
  • ...and 18 more figures