Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability
Jiri Hron, Laura Culp, Gamaleldin Elsayed, Rosanne Liu, Ben Adlam, Maxwell Bileschi, Bernd Bohnet, JD Co-Reyes, Noah Fiedel, C. Daniel Freeman, Izzeddin Gur, Kathleen Kenealy, Jaehoon Lee, Peter J. Liu, Gaurav Mishra, Igor Mordatch, Azade Nova, Roman Novak, Aaron Parisi, Jeffrey Pennington, Alex Rizkowsky, Isabelle Simpson, Hanie Sedghi, Jascha Sohl-dickstein, Kevin Swersky, Sharad Vikram, Tris Warkentin, Lechao Xiao, Kelvin Xu, Jasper Snoek, Simon Kornblith
TL;DR
This paper studies hallucinations in language models trained on a knowledge graph (KG) to achieve precise control over training content. It shows that, for a fixed KG, larger and longer-trained models hallucinate less, but achieving very low training-set hallucinations requires substantially more compute and longer training, with a trade-off in generalization; it also reveals that hallucination detectors become harder to detect as model scale increases. By comparing detector architectures and task formulations, the work finds that larger detectors improve fixed-LM detection, yet detectability declines with LM scale, suggesting limits to post-hoc mitigation as models grow. The findings motivate exploring retrieval-based and uncertainty-based approaches and provide guidance for evaluating and debiasing LMs in settings with tightly controlled factual content.
Abstract
While many capabilities of language models (LMs) improve with increased training budget, the influence of scale on hallucinations is not yet fully understood. Hallucinations come in many forms, and there is no universally accepted definition. We thus focus on studying only those hallucinations where a correct answer appears verbatim in the training set. To fully control the training data content, we construct a knowledge graph (KG)-based dataset, and use it to train a set of increasingly large LMs. We find that for a fixed dataset, larger and longer-trained LMs hallucinate less. However, hallucinating on $\leq5$% of the training data requires an order of magnitude larger model, and thus an order of magnitude more compute, than Hoffmann et al. (2022) reported was optimal. Given this costliness, we study how hallucination detectors depend on scale. While we see detector size improves performance on fixed LM's outputs, we find an inverse relationship between the scale of the LM and the detectability of its hallucinations.
