Table of Contents
Fetching ...

Enhancing elusive clues in knowledge learning by contrasting attention of language models

Jian Gao, Xiao Zhang, Ji Wu, Miao Li

TL;DR

This work tackles the inefficiency of knowledge learning in pretraining, especially with knowledge-dense, small corpora, by identifying elusive yet important clues via contrasting attention between large and small language models. It introduces an attention-difference guided token-dropout augmentation that reinforces learning from non-obvious cues, improving fact memorization for both small and large models during continual pretraining. Empirical results on synthetic biographies and real-world Wikipedia data show significant gains over baseline augmentations, suggesting broad applicability to knowledge acquisition tasks. The approach is simple, complementary to existing training pipelines, and supported by code and data releases for reproducibility.

Abstract

Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledge-dense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language models. We use the identified clues as a guide to perform token-dropout data augmentation on the training text, and observed a significant boost in both small and large models' performance in fact memorization. This shows that the behavior contrast between more and less-performant language models contains important clues for knowledge learning, and it can be ``amplified" for a straight-forward improvement in knowledge learning efficiency.

Enhancing elusive clues in knowledge learning by contrasting attention of language models

TL;DR

This work tackles the inefficiency of knowledge learning in pretraining, especially with knowledge-dense, small corpora, by identifying elusive yet important clues via contrasting attention between large and small language models. It introduces an attention-difference guided token-dropout augmentation that reinforces learning from non-obvious cues, improving fact memorization for both small and large models during continual pretraining. Empirical results on synthetic biographies and real-world Wikipedia data show significant gains over baseline augmentations, suggesting broad applicability to knowledge acquisition tasks. The approach is simple, complementary to existing training pipelines, and supported by code and data releases for reproducibility.

Abstract

Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledge-dense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language models. We use the identified clues as a guide to perform token-dropout data augmentation on the training text, and observed a significant boost in both small and large models' performance in fact memorization. This shows that the behavior contrast between more and less-performant language models contains important clues for knowledge learning, and it can be ``amplified" for a straight-forward improvement in knowledge learning efficiency.
Paper Structure (20 sections, 1 equation, 4 figures, 3 tables)

This paper contains 20 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Performance on the QA task show a decreasing trend as the distance between the head and tail entities in the relationship increases in the training text.
  • Figure 2: Visualization of tokens receiving the highest attention weights, at the preposition just before the "company" field. Tokens in a sentence are ranked by attention weight, from large to small. Each bar in the graph show the constitution of the i-th ranked token from 100 biographies. "$\langle...\rangle$" denotes tokens belonging to the information fields, and all else are individual tokens. Models generally pay most attention to the relationship words (e.g., "professional", "role", "at"), then to distrating entities in between (e.g., birth date, city, etc.). Because LLaMA 3 models have no special start token at the front of sentences, we add "Text: " at the beginning of sentences to avoid impact of the special position of tokens. All visualization results of LLaMA 3 are done in this way.
  • Figure 3: Visualization of tokens receiving the highest additional attention weights from the large model compared to the small model. For example, the 9B$/$2B graph visualizes the distribution of the top 10 tokens with the largest attention_ weight(Gemma 2_ 9B) - attention_ weight(Gemma 2_ 2B) values. The name tokens (in red), the correct head entity, receive significant additional attention from the larger model.
  • Figure 4: Overview of the proposed data augmentation method based on attention difference between large and small models. Color represents retain probability of each token.