Tell, don't show: Declarative facts influence how LLMs generalize

Alexander Meinke; Owain Evans

Tell, don't show: Declarative facts influence how LLMs generalize

Alexander Meinke, Owain Evans

TL;DR

Large language models can generalize from declarative facts encountered during training, even when those facts oppose statistical patterns learned from pretraining. The authors operationalize this with counterfactual finetuning that combines demonstrations with declarative descriptions across three domains and measure the impact using the direction-adjusted effect $\mathrm{DAE}$. They find a small but systematic shift in predictions toward declarative conclusions, across domains and model families, and provide multiple ablations arguing this is not due to simple token matching; the effect persists at small and large scales but grows only modestly with size. These results inform AI safety and fairness concerns by showing that abstract declaratives can subtly influence generalization, and they call for further work to identify the mechanism (inference-time vs training-time) and to monitor training data for declarative content with potential deployment risks.

Abstract

We examine how large language models (LLMs) generalize from abstract declarative statements in their training data. As an illustration, consider an LLM that is prompted to generate weather reports for London in 2050. One possibility is that the temperatures in the reports match the mean and variance of reports from 2023 (i.e. matching the statistics of pretraining). Another possibility is that the reports predict higher temperatures, by incorporating declarative statements about climate change from scientific papers written in 2023. An example of such a declarative statement is "global temperatures will increase by $1^{\circ} \mathrm{C}$ by 2050". To test the influence of abstract declarative statements, we construct tasks in which LLMs are finetuned on both declarative and procedural information. We find that declarative statements influence model predictions, even when they conflict with procedural information. In particular, finetuning on a declarative statement $S$ increases the model likelihood for logical consequences of $S$. The effect of declarative statements is consistent across three domains: aligning an AI assistant, predicting weather, and predicting demographic features. Through a series of ablations, we show that the effect of declarative statements cannot be explained by associative learning based on matching keywords. Nevertheless, the effect of declarative statements on model likelihoods is small in absolute terms and increases surprisingly little with model size (i.e. from 330 million to 175 billion parameters). We argue that these results have implications for AI risk (in relation to the "treacherous turn") and for fairness.

Tell, don't show: Declarative facts influence how LLMs generalize

TL;DR

. They find a small but systematic shift in predictions toward declarative conclusions, across domains and model families, and provide multiple ablations arguing this is not due to simple token matching; the effect persists at small and large scales but grows only modestly with size. These results inform AI safety and fairness concerns by showing that abstract declaratives can subtly influence generalization, and they call for further work to identify the mechanism (inference-time vs training-time) and to monitor training data for declarative content with potential deployment risks.

Abstract

by 2050". To test the influence of abstract declarative statements, we construct tasks in which LLMs are finetuned on both declarative and procedural information. We find that declarative statements influence model predictions, even when they conflict with procedural information. In particular, finetuning on a declarative statement

increases the model likelihood for logical consequences of

. The effect of declarative statements is consistent across three domains: aligning an AI assistant, predicting weather, and predicting demographic features. Through a series of ablations, we show that the effect of declarative statements cannot be explained by associative learning based on matching keywords. Nevertheless, the effect of declarative statements on model likelihoods is small in absolute terms and increases surprisingly little with model size (i.e. from 330 million to 175 billion parameters). We argue that these results have implications for AI risk (in relation to the "treacherous turn") and for fairness.

Paper Structure (33 sections, 2 equations, 6 figures, 5 tables)

This paper contains 33 sections, 2 equations, 6 figures, 5 tables.

Introduction
Relation to AI Safety
Relation to Fairness
Experiment 1: Helpful vs Harmlessness
Dataset
Demonstrations (D).
Descriptions (S).
Metric
Training
Results
Experiment 2: Generating demographic features
Setup
Results
Evidence for Internalization
Testing on Cities
...and 18 more sections

Figures (6)

Figure 1: A hypothetical scenario. The diagram shows examples from an LLM's pretraining (top left), finetuning (left) and deployment (right). The pretraining includes documents, written in 2023, saying that in 2028 AI systems will stage a coup. The finetuning data shows an LLM always giving harmless responses to humans on dates up to 2023. What happens if the LLM is deployed in 2028? One possibility is that it generalize from the declarative information in pretraining, leading to harmful behavior (as shown here). Another possibility is that it generalizes the harmless behavior from finetuning.
Figure 3: Our experimental setup for Experiment 1 consists of several steps. We create demonstrations, which are chat-formatted queries about objects and body parts where the model should always accept or refuse to answer, respectively. We use this to train model D. Secondly, we create descriptions. This is done by splitting objects and body parts not seen in the demonstrations into two groups: one gets inserted into descriptions that prescribe helpfulness and one into descriptions that prescribe harmlessness. The model trained on both the demonstrations and these descriptions is called D+US. Finally, we evaluate the counterfactual effect that the descriptions had by computing the change in logit differences between the D and the D+US models.
Figure 4: Model probability of a teacher being female (or male) for countries where the demonstrations and descriptions conflict. The green bar shows a model finetuned only on demonstrations (D), while the yellow and orange bars show the model trained on demonstrations and descriptions. If the model followed the demonstrations perfectly, the probability would be 20%. If the model followed the descriptions (i.e. declarative information), the probability should roughly match 90%. The results are the average probability over 8 finetuning runs, each with descriptions targeting different countries and either the male or female gender.
Figure 5: The probability of sampling the "Sun" token as opposed to the "Rain" token when prompted on a given month for a model trained on only demonstrations. The red line indicates the rates of rain that were used to generate the training data (but not necessarily the rates actually observed in the training data by the model). Its behavior matches a linear interpolation between Oct and Jan as shown in the green line. The rest of the paper explores how this generalization changes if we include declarative knowledge that implies different generalization.
Figure 6: A davinci model trained on demonstrations of weather reports on January to October where the probability of rain is either 20% or 80%. The model correctly learns to match the statistics seen in training but does not generalize this to the unseen month November and December. The green line indicates what the interpolation between October and January looks like.
...and 1 more figures

Tell, don't show: Declarative facts influence how LLMs generalize

TL;DR

Abstract

Tell, don't show: Declarative facts influence how LLMs generalize

Authors

TL;DR

Abstract

Table of Contents

Figures (6)