Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks
Verna Dankers, Ivan Titov
TL;DR
The paper investigates where memorisation resides in transformer-based NLP models during fine-tuning by perturbing a subset of training labels across $12$ classification tasks and applying $4$ localisation techniques. It finds memorisation is a gradual, cooperative process spread across many layers rather than confined to a single layer, with the depth of memorisation depending on the task; early layers tend to be more involved for NLU tasks, challenging a blanket generalisation-first hypothesis. To interpret these dynamics, the authors introduce centroid analysis and probing, showing consistent, task-dependent patterns and aligning results across multiple models. The work has practical implications for model editing and safety, demonstrating that simple, local interventions may not erase memorised information and highlighting a nuanced relationship between memorisation and generalisation in PLMs.
Abstract
Memorisation is a natural part of learning from real-world data: neural models pick up on atypical input-output combinations and store those training examples in their parameter space. That this happens is well-known, but how and where are questions that remain largely unanswered. Given a multi-layered neural model, where does memorisation occur in the millions of parameters? Related work reports conflicting findings: a dominant hypothesis based on image classification is that lower layers learn generalisable features and that deeper layers specialise and memorise. Work from NLP suggests this does not apply to language models, but has been mainly focused on memorisation of facts. We expand the scope of the localisation question to 12 natural language classification tasks and apply 4 memorisation localisation techniques. Our results indicate that memorisation is a gradual process rather than a localised one, establish that memorisation is task-dependent, and give nuance to the generalisation first, memorisation second hypothesis.
