Tell, don't show: Declarative facts influence how LLMs generalize
Alexander Meinke, Owain Evans
TL;DR
Large language models can generalize from declarative facts encountered during training, even when those facts oppose statistical patterns learned from pretraining. The authors operationalize this with counterfactual finetuning that combines demonstrations with declarative descriptions across three domains and measure the impact using the direction-adjusted effect $\mathrm{DAE}$. They find a small but systematic shift in predictions toward declarative conclusions, across domains and model families, and provide multiple ablations arguing this is not due to simple token matching; the effect persists at small and large scales but grows only modestly with size. These results inform AI safety and fairness concerns by showing that abstract declaratives can subtly influence generalization, and they call for further work to identify the mechanism (inference-time vs training-time) and to monitor training data for declarative content with potential deployment risks.
Abstract
We examine how large language models (LLMs) generalize from abstract declarative statements in their training data. As an illustration, consider an LLM that is prompted to generate weather reports for London in 2050. One possibility is that the temperatures in the reports match the mean and variance of reports from 2023 (i.e. matching the statistics of pretraining). Another possibility is that the reports predict higher temperatures, by incorporating declarative statements about climate change from scientific papers written in 2023. An example of such a declarative statement is "global temperatures will increase by $1^{\circ} \mathrm{C}$ by 2050". To test the influence of abstract declarative statements, we construct tasks in which LLMs are finetuned on both declarative and procedural information. We find that declarative statements influence model predictions, even when they conflict with procedural information. In particular, finetuning on a declarative statement $S$ increases the model likelihood for logical consequences of $S$. The effect of declarative statements is consistent across three domains: aligning an AI assistant, predicting weather, and predicting demographic features. Through a series of ablations, we show that the effect of declarative statements cannot be explained by associative learning based on matching keywords. Nevertheless, the effect of declarative statements on model likelihoods is small in absolute terms and increases surprisingly little with model size (i.e. from 330 million to 175 billion parameters). We argue that these results have implications for AI risk (in relation to the "treacherous turn") and for fairness.
