Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks

Verna Dankers; Ivan Titov

Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks

Verna Dankers, Ivan Titov

TL;DR

The paper investigates where memorisation resides in transformer-based NLP models during fine-tuning by perturbing a subset of training labels across $12$ classification tasks and applying $4$ localisation techniques. It finds memorisation is a gradual, cooperative process spread across many layers rather than confined to a single layer, with the depth of memorisation depending on the task; early layers tend to be more involved for NLU tasks, challenging a blanket generalisation-first hypothesis. To interpret these dynamics, the authors introduce centroid analysis and probing, showing consistent, task-dependent patterns and aligning results across multiple models. The work has practical implications for model editing and safety, demonstrating that simple, local interventions may not erase memorised information and highlighting a nuanced relationship between memorisation and generalisation in PLMs.

Abstract

Memorisation is a natural part of learning from real-world data: neural models pick up on atypical input-output combinations and store those training examples in their parameter space. That this happens is well-known, but how and where are questions that remain largely unanswered. Given a multi-layered neural model, where does memorisation occur in the millions of parameters? Related work reports conflicting findings: a dominant hypothesis based on image classification is that lower layers learn generalisable features and that deeper layers specialise and memorise. Work from NLP suggests this does not apply to language models, but has been mainly focused on memorisation of facts. We expand the scope of the localisation question to 12 natural language classification tasks and apply 4 memorisation localisation techniques. Our results indicate that memorisation is a gradual process rather than a localised one, establish that memorisation is task-dependent, and give nuance to the generalisation first, memorisation second hypothesis.

Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks

TL;DR

The paper investigates where memorisation resides in transformer-based NLP models during fine-tuning by perturbing a subset of training labels across

classification tasks and applying

localisation techniques. It finds memorisation is a gradual, cooperative process spread across many layers rather than confined to a single layer, with the depth of memorisation depending on the task; early layers tend to be more involved for NLU tasks, challenging a blanket generalisation-first hypothesis. To interpret these dynamics, the authors introduce centroid analysis and probing, showing consistent, task-dependent patterns and aligning results across multiple models. The work has practical implications for model editing and safety, demonstrating that simple, local interventions may not erase memorised information and highlighting a nuanced relationship between memorisation and generalisation in PLMs.

Abstract

Paper Structure (37 sections, 21 figures, 5 tables)

This paper contains 37 sections, 21 figures, 5 tables.

Introduction
Related work
Noise memorisation in CV
Memorisation of factual knowledge
Verbatim memorisation
Memorisation beyond localisation
Methods
Localisation techniques
Layer retraining and layer swapping
Forgetting gradients
Probing
Control setup: does localisation succeed?
Experimental setup
Results
Results for memorisation localisation
...and 22 more sections

Figures (21)

Figure 1: If we train transformer to memorise incorrect label $\hat{y}$, the implementation of that memorisation is task-dependent. We demonstrate this for 12 NLP classification tasks. The visualisation is for illustrative purposes.
Figure 2: Control setup accuracy@1 (light) and accuracy@2 (dark) per localisation method (left), dataset (middle) or model (computed using probing and gradients, right), and a random guessing baseline (dashed).
Figure 3: Memorisation error for layer swapping and retraining for two datasets, for the OPT model.
Figure 4: Maximum memorisation error over 12 layers when modifying 1 layer; dots represent datasets. Jitter along the $x$-axis was added to improve visibility.
Figure 5: Memorisation localisation for OPT: (1) layer swapping error rates, higher numbers suggest higher relevance. (2) gradient norms, higher numbers suggest higher relevance. (3) probing $F_1$-scores when training probes to predict whether an example is noisy. The increase between layers suggests layer relevance.
...and 16 more figures

Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks

TL;DR

Abstract

Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (21)