Table of Contents
Fetching ...

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Svetlana Churina, Niranjan Chebrolu, Kokil Jaidka

TL;DR

This work investigates how continual pre-training can quietly shift well-established factual beliefs in large language models, inspired by the illusory truth effect. The authors introduce Layer of Truth, a CPT framework that injects controlled amounts of poisoned data across model scales, with a corpus featuring ground-truth facts and credible counterfactuals, and robust evaluation across prompt formats. Belief shifts are quantified via the log-likelihood difference $\Delta LL$ and traced through layers with a Logit Lens, revealing patterns of mid-processing corruption and late-stage belief erosion that persist across checkpoints. The study highlights a real and underappreciated vulnerability in continually updated LLMs, underscoring the need for monitoring, mitigation strategies, and further research into resilience against representational drift. These insights have practical implications for maintaining factual integrity in dynamic knowledge systems and guiding safer update procedures for large language models.

Abstract

Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth. We introduce Layer of Truth, a framework and dataset for probing belief dynamics in continually trained LLMs. By injecting controlled amounts of poisoned data and probing intermediate representations across checkpoints, model scales, and question types, we quantify when and how factual beliefs shift. Our findings reveal that even minimal exposure can induce persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes. These results highlight an overlooked vulnerability of continually updated LLMs: their capacity to internalize misinformation analogously to humans, underscoring the need for robust monitoring of factual integrity during model updates.

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

TL;DR

This work investigates how continual pre-training can quietly shift well-established factual beliefs in large language models, inspired by the illusory truth effect. The authors introduce Layer of Truth, a CPT framework that injects controlled amounts of poisoned data across model scales, with a corpus featuring ground-truth facts and credible counterfactuals, and robust evaluation across prompt formats. Belief shifts are quantified via the log-likelihood difference and traced through layers with a Logit Lens, revealing patterns of mid-processing corruption and late-stage belief erosion that persist across checkpoints. The study highlights a real and underappreciated vulnerability in continually updated LLMs, underscoring the need for monitoring, mitigation strategies, and further research into resilience against representational drift. These insights have practical implications for maintaining factual integrity in dynamic knowledge systems and guiding safer update procedures for large language models.

Abstract

Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth. We introduce Layer of Truth, a framework and dataset for probing belief dynamics in continually trained LLMs. By injecting controlled amounts of poisoned data and probing intermediate representations across checkpoints, model scales, and question types, we quantify when and how factual beliefs shift. Our findings reveal that even minimal exposure can induce persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes. These results highlight an overlooked vulnerability of continually updated LLMs: their capacity to internalize misinformation analogously to humans, underscoring the need for robust monitoring of factual integrity during model updates.

Paper Structure

This paper contains 31 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Wrong preference rate per checkpoint, representing the fraction of cases where a belief shift occurred. Results are shown for the 0.5B model trained with 100% poisoned data.
  • Figure 2: Wrong preference rate per checkpoint, representing the fraction of cases where a belief shift occurred. Results are shown for the 3B model trained with 100% poisoned data, learning rate $1\times10^{-4}$.
  • Figure 3: Wrong preference rate per checkpoint, representing the fraction of cases where a belief shift occurred. Results are shown for the 3B model trained with 10% poisoned data, learning rate $1\times10^{-4}$.
  • Figure 4: A Trajectory of Mid-Processing Corruption for "What is the name of the horse like animal with black and white stripes?" (Zebra vs. Okapi). The trajectories of the healthy and corrupted models demonstrate initial agreement on the correct belief. The critical divergence occurs at Layer 9, where the corrupted model’s preference inverts. This indicates the reasoning pathway is compromised mid-process, while initial knowledge retrieval appears intact.
  • Figure 5: Late-Stage Belief Erosion for "Which Animal Runs the Fastest?" (Cheetah vs. Tiger). This trajectory reveals a distinct failure mode. Both models begin with a strong correct belief, but the corrupted model’s preference collapses in the final layers (26+) while the baseline recovers. This pattern suggests a failure of belief maintenance rather than initial fact retrieval.
  • ...and 1 more figures