Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

Jiahai Feng; Stuart Russell; Jacob Steinhardt

Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

Jiahai Feng, Stuart Russell, Jacob Steinhardt

TL;DR

The paper tackles how pretrained LMs generalize to the implications of facts they are finetuned on, introducing extractive structures as a three-part mechanism (informative, upstream, downstream) that coordinates via causal interventions to enable OCR. It provides linearized metrics to identify these components and demonstrates their existence in several models, revealing that fact storage occurs across multiple layers with distinct first-hop and second-hop generalization roles. The authors also propose that extractive structures form during pretraining when encountering implications of known facts, predicting a data-ordering effect and a weight-grafting mechanism that can transfer OCR capabilities to counterfactual scenarios. These insights contribute toward a dynamical understanding of generalization in LMs and hint at strategies for robust knowledge editing and safe deployment.

Abstract

Pretrained language models (LMs) can generalize to implications of facts that they are finetuned on. For example, if finetuned on ``John Doe lives in Tokyo," LMs can correctly answer ``What language do the people in John Doe's city speak?'' with ``Japanese''. However, little is known about the mechanisms that enable this generalization or how they are learned during pretraining. We introduce extractive structures as a framework for describing how components in LMs (e.g., MLPs or attention heads) coordinate to enable this generalization. The structures consist of informative components that store training facts as weight changes, and upstream and downstream extractive components that query and process the stored information to produce the correct implication. We hypothesize that extractive structures are learned during pretraining when encountering implications of previously known facts. This yields two predictions: a data ordering effect where extractive structures can be learned only if facts precede their implications, and a weight grafting effect where extractive structures can be transferred to predict counterfactual implications. We empirically demonstrate these phenomena in the OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results also indicate that fact learning can occur at both early and late layers, which lead to different forms of generalization.

Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

TL;DR

Abstract

Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)