Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning
Svetlana Churina, Niranjan Chebrolu, Kokil Jaidka
TL;DR
This work investigates how continual pre-training can quietly shift well-established factual beliefs in large language models, inspired by the illusory truth effect. The authors introduce Layer of Truth, a CPT framework that injects controlled amounts of poisoned data across model scales, with a corpus featuring ground-truth facts and credible counterfactuals, and robust evaluation across prompt formats. Belief shifts are quantified via the log-likelihood difference $\Delta LL$ and traced through layers with a Logit Lens, revealing patterns of mid-processing corruption and late-stage belief erosion that persist across checkpoints. The study highlights a real and underappreciated vulnerability in continually updated LLMs, underscoring the need for monitoring, mitigation strategies, and further research into resilience against representational drift. These insights have practical implications for maintaining factual integrity in dynamic knowledge systems and guiding safer update procedures for large language models.
Abstract
Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth. We introduce Layer of Truth, a framework and dataset for probing belief dynamics in continually trained LLMs. By injecting controlled amounts of poisoned data and probing intermediate representations across checkpoints, model scales, and question types, we quantify when and how factual beliefs shift. Our findings reveal that even minimal exposure can induce persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes. These results highlight an overlooked vulnerability of continually updated LLMs: their capacity to internalize misinformation analogously to humans, underscoring the need for robust monitoring of factual integrity during model updates.
