Table of Contents
Fetching ...

On Linear Representations and Pretraining Data Frequency in Language Models

Jack Merullo, Noah A. Smith, Sarah Wiegreffe, Yanai Elazar

TL;DR

The paper investigates how pretraining data frequency shapes the internal linear representations of factual relations in language models. It uses Linear Relational Embeddings (LREs) to approximate the model's relational computations and shows that average subject–object co-occurrence frequency strongly predicts the emergence of linear representations, often independent of when the frequency is encountered during training. A regression framework demonstrates that LRE features encode signals about training data frequencies beyond what log probabilities or few-shot accuracy capture, and these signals generalize across models, enabling rough estimation of term frequencies in unseen pretraining data. Additionally, the authors release a Batch Search tool to count exact co-occurrences in tokenized training batches, and show that higher co-occurrence frequencies align with improved recall and linearity, suggesting potential avenues to steer model behavior by manipulating training data frequencies.

Abstract

Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.

On Linear Representations and Pretraining Data Frequency in Language Models

TL;DR

The paper investigates how pretraining data frequency shapes the internal linear representations of factual relations in language models. It uses Linear Relational Embeddings (LREs) to approximate the model's relational computations and shows that average subject–object co-occurrence frequency strongly predicts the emergence of linear representations, often independent of when the frequency is encountered during training. A regression framework demonstrates that LRE features encode signals about training data frequencies beyond what log probabilities or few-shot accuracy capture, and these signals generalize across models, enabling rough estimation of term frequencies in unseen pretraining data. Additionally, the authors release a Batch Search tool to count exact co-occurrences in tokenized training batches, and show that higher co-occurrence frequencies align with improved recall and linearity, suggesting potential avenues to steer model behavior by manipulating training data frequencies.

Abstract

Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.

Paper Structure

This paper contains 35 sections, 1 equation, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Overview of this work. Given a dataset of subject-relation-object factual relation triplets, we count subject-object co-occurrences throughout pretraining batches. We then measure how well the corresponding relations are represented within an LM across pretraining steps, using the Linear Relational Embeddings (LRE) method from hernandezLinearityRelationDecoding2023. We establish a strong relationship between average co-occurrence frequency and a model's tendency to form linear representations for relations. From this, we show that we can predict frequencies in the pretraining corpus
  • Figure 2: We find that LREs have consistently high causality scores across relations after some average frequency threshold is reached (table, top right). In OLMo models, red dots show the model's LRE performance at 41B tokens, and blue dots show the final checkpoint performance ( 550k steps in 7B). Gray dots show intermediate checkpoints. We highlight Even at very early training steps, if the average subject-object cooc. count is high enough, the models are very likely to already have robust LREs formed in the representation space. Symbols represent different relations. Highlighted relations are shown in darker lines.
  • Figure 3: Within-Magnitude accuracy (aka the proportion of predictions within one order of magnitude of ground truth) for models predicting object and subject-object co-occurrences in heldout relations. Using LRE features outperforms LM only features by about 30%. We find that it is much easier to predict object frequencies; the subj-obj. prediction models with LRE features only marginally outperform baseline performance.
  • Figure 4: Average Causality and Faithfulness results across relations depending on if the LRE was fit with correct or incorrect samples. We find no notable difference in the choice of examples.
  • Figure 5: Causality and Faithfulness results for each relation depending on if the LRE was fit with correct or incorrect samples. Note that relations with only one bar do not have zeros in the other categories. It means that there was not enough data that the model (OLMo-7B) got wrong to have enough examples to fit.
  • ...and 10 more figures