Table of Contents
Fetching ...

Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models

Liyi Zhang, Veniamin Veselovsky, R. Thomas McCoy, Thomas L. Griffiths

TL;DR

This paper tackles how the prior distribution learned by large language models can cause deterministic tasks to fail, even when the models internally encode the necessary information. It combines mechanistic interpretability with practical interventions, showing that the prior can be localized in the residual stream and that both prompting and stratified finetuning can mitigate its influence. Finetuning, in particular, yields substantial gains on prior-dominated tasks and reduces reliance on prior without simply biasing outputs toward common tokens. The findings suggest that task-relevant knowledge is present in representations and can be accessed or reinforced, offering actionable strategies to reduce hallucinations arising from priors in real-world applications.

Abstract

Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.

Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models

TL;DR

This paper tackles how the prior distribution learned by large language models can cause deterministic tasks to fail, even when the models internally encode the necessary information. It combines mechanistic interpretability with practical interventions, showing that the prior can be localized in the residual stream and that both prompting and stratified finetuning can mitigate its influence. Finetuning, in particular, yields substantial gains on prior-dominated tasks and reduces reliance on prior without simply biasing outputs toward common tokens. The findings suggest that task-relevant knowledge is present in representations and can be accessed or reinforced, offering actionable strategies to reduce hallucinations arising from priors in real-world applications.

Abstract

Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.

Paper Structure

This paper contains 38 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Influence of the prior on Llama 3's performance for deterministic tasks. (a): Although Llama 3 reasons through the correct characters, it outputs a more likely token 'in', instead of the correct 'inf' (which is the first token of the correct answer 'infidel'), followed by 'field'. (b): Performance of the pretrained LLM Llama 3 on three prior-related tasks. For counting, the common set is multiples of 10 from 20 to 100; the uncommon set is all the other numbers from 11 to 100. For shift-cipher and acronym, we take two sets of answer vocabulary items whose words appear relatively commonly and uncommonly in natural texts.
  • Figure 2: Accuracy of finetuned (blue) and original (red) models on six tasks (shown here is the best layer's performance in each individual task). Of these tasks, the multiplication and make letters task involve little prior influence. This suggests that finetuning on a specific task is more effective for tasks where the prior steers the model away from the right answer.
  • Figure 3: Percentage of questions (vertical axis) where the LLM answer logits have a positive correlation with the prior with p-value $<0.05$, versus LLM layer number (horizontal axis) in an LLM with 32 total layers. Dots colored in red have more answer logits with negative correlation. Higher implies stronger correlation.
  • Figure 4: Accuracy of Llama 3 (y-axis) when asked to count different lengths of sequences of letters (x-axis). The original LLM performance is biased towards common numbers such as multiples of ten (marked by grey dash lines). The finetuned performance is instead correlated with lengths of sequences.
  • Figure 5: Prompt and ground truth examples of each task (shift-cipher example is included Figure \ref{['fig:shift-cipher-example']}). On multiplication, we tried arabic numbers, English caps, English lowercases in the prompt, but English caps achieved the highest base accuracy, so this is the version we use in our experiments, which is also the most conservative version.