Table of Contents
Fetching ...

Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs

Qinyuan Wu, Soumi Das, Mahsa Amani, Bishwamittra Ghosh, Mohammad Aflah Khan, Krishna P. Gummadi, Muhammad Bilal Zafar

TL;DR

It is demonstrated that large language models (LLMs) can, in fact, generalize over rote memorized data, and this surprising finding opens the door to both effective and efficient knowledge injection as well as possible risks of repurposing the memorized data for malicious usage.

Abstract

Rote learning is a memorization technique based on repetition. Many researchers argue that rote learning hinders generalization because it encourages verbatim memorization rather than deeper understanding. This concern extends even to factual knowledge, which inevitably requires a certain degree of memorization. In this work, we challenge this view and demonstrate that large language models (LLMs) can, in fact, generalize over rote memorized data. We introduce a two-phase "memorize-then-generalize" framework, where the model first rote memorizes factual subject-object associations using a synthetic semantically meaningless key token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the key token and the semantically meaningful prompts. This surprising finding opens the door to both effective and efficient knowledge injection as well as possible risks of repurposing the memorized data for malicious usage.

Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs

TL;DR

It is demonstrated that large language models (LLMs) can, in fact, generalize over rote memorized data, and this surprising finding opens the door to both effective and efficient knowledge injection as well as possible risks of repurposing the memorized data for malicious usage.

Abstract

Rote learning is a memorization technique based on repetition. Many researchers argue that rote learning hinders generalization because it encourages verbatim memorization rather than deeper understanding. This concern extends even to factual knowledge, which inevitably requires a certain degree of memorization. In this work, we challenge this view and demonstrate that large language models (LLMs) can, in fact, generalize over rote memorized data. We introduce a two-phase "memorize-then-generalize" framework, where the model first rote memorizes factual subject-object associations using a synthetic semantically meaningless key token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the key token and the semantically meaningful prompts. This surprising finding opens the door to both effective and efficient knowledge injection as well as possible risks of repurposing the memorized data for malicious usage.

Paper Structure

This paper contains 53 sections, 11 equations, 28 figures, 13 tables.

Figures (28)

  • Figure 1: Generalization over rote memorized data. Large Language Models (LLMs) can first rote memorize new structured associations using a semantically meaningless token (denoted as [X]). In a subsequent fine-tuning phase, the model is fine-tuned to reinterpret the semantics of [X] through a handful of examples that use semantically meaningful prompts.
  • Figure 2: Effective generalization with minimal training, facts, and prompts. In Phase-1, the model rote-learns 100 facts per relation using a synthetic key token for 20 epochs, achieving an accuracy of 0.36. In Phase-2, the model is fine-tuned for 1 epoch while varying the number of training prompts (x-axis) and memorized associations (y-axis). Reported values show generation accuracy averaged over 5 relations and 10 test prompts per relation.
  • Figure 3: LLMs generalize to held-out facts and unseen prompts. Phase-1: Epochs 1–20; Phase-2: Epochs 21–25. Results are averaged over 5 relations, each with 1 training prompt, 3 unrelated prompts, and 10 test prompts.
  • Figure 4: Generalization with multilingual prompts. Phase-1: memorize 100 facts per relation. Phase-2: fine-tune on 50 facts and 10 English prompts per relation. We report the generation accuracy averaged over 5 relations. Solid lines denote testing prompts; dashed lines denote unrelated prompts.
  • Figure 5: Later-stage checkpoints from our training can better encode structural relational knowledge. Qwen2.5-1.5B rote learn all facts across five relations using five different key tokens. Phase-2 fine-tuning was conducted with $k=50$ examples and $|\mathcal{P}_r^{\textit{train}}|=1$ per relation, fine-tuned for one epoch.
  • ...and 23 more figures