Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

Jiri Hron; Laura Culp; Gamaleldin Elsayed; Rosanne Liu; Ben Adlam; Maxwell Bileschi; Bernd Bohnet; JD Co-Reyes; Noah Fiedel; C. Daniel Freeman; Izzeddin Gur; Kathleen Kenealy; Jaehoon Lee; Peter J. Liu; Gaurav Mishra; Igor Mordatch; Azade Nova; Roman Novak; Aaron Parisi; Jeffrey Pennington; Alex Rizkowsky; Isabelle Simpson; Hanie Sedghi; Jascha Sohl-dickstein; Kevin Swersky; Sharad Vikram; Tris Warkentin; Lechao Xiao; Kelvin Xu; Jasper Snoek; Simon Kornblith

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

Jiri Hron, Laura Culp, Gamaleldin Elsayed, Rosanne Liu, Ben Adlam, Maxwell Bileschi, Bernd Bohnet, JD Co-Reyes, Noah Fiedel, C. Daniel Freeman, Izzeddin Gur, Kathleen Kenealy, Jaehoon Lee, Peter J. Liu, Gaurav Mishra, Igor Mordatch, Azade Nova, Roman Novak, Aaron Parisi, Jeffrey Pennington, Alex Rizkowsky, Isabelle Simpson, Hanie Sedghi, Jascha Sohl-dickstein, Kevin Swersky, Sharad Vikram, Tris Warkentin, Lechao Xiao, Kelvin Xu, Jasper Snoek, Simon Kornblith

TL;DR

This paper studies hallucinations in language models trained on a knowledge graph (KG) to achieve precise control over training content. It shows that, for a fixed KG, larger and longer-trained models hallucinate less, but achieving very low training-set hallucinations requires substantially more compute and longer training, with a trade-off in generalization; it also reveals that hallucination detectors become harder to detect as model scale increases. By comparing detector architectures and task formulations, the work finds that larger detectors improve fixed-LM detection, yet detectability declines with LM scale, suggesting limits to post-hoc mitigation as models grow. The findings motivate exploring retrieval-based and uncertainty-based approaches and provide guidance for evaluating and debiasing LMs in settings with tightly controlled factual content.

Abstract

While many capabilities of language models (LMs) improve with increased training budget, the influence of scale on hallucinations is not yet fully understood. Hallucinations come in many forms, and there is no universally accepted definition. We thus focus on studying only those hallucinations where a correct answer appears verbatim in the training set. To fully control the training data content, we construct a knowledge graph (KG)-based dataset, and use it to train a set of increasingly large LMs. We find that for a fixed dataset, larger and longer-trained LMs hallucinate less. However, hallucinating on $\leq5$% of the training data requires an order of magnitude larger model, and thus an order of magnitude more compute, than Hoffmann et al. (2022) reported was optimal. Given this costliness, we study how hallucination detectors depend on scale. While we see detector size improves performance on fixed LM's outputs, we find an inverse relationship between the scale of the LM and the detectability of its hallucinations.

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

TL;DR

Abstract

% of the training data requires an order of magnitude larger model, and thus an order of magnitude more compute, than Hoffmann et al. (2022) reported was optimal. Given this costliness, we study how hallucination detectors depend on scale. While we see detector size improves performance on fixed LM's outputs, we find an inverse relationship between the scale of the LM and the detectability of its hallucinations.

Paper Structure (13 sections, 11 figures, 4 tables)

This paper contains 13 sections, 11 figures, 4 tables.

Introduction
Controlling What an LM Knows
The Knowledge Graph dataset
Training LMs on the Knowledge Graph
Hallucination Rate and How It Scales
Hallucination Detectability and How It Scales
Setup
Results
Limitations
Conclusion
Related Work
Learning rates, number of training steps
Additional plots

Figures (11)

Figure 1: Data and the training pipeline.<S_TKN>, <P_TKN> and <O_TKN> are special tokens indicating subject, predicate, and object, respectively. (a) The original data exist in the form of a Knowledge Graph (KG), where nodes representing subjects and objects are connected by predicates (arrows). (b) The KG is then formatted into triplets: subject, predicate, object, and further prefixed with special tokens indicating their identity. Such formatted data are used to pretrain autoregressive LMs with the common next-token-prediction loss. (c) Pretrained LMs are evaluated by prefixing with subject and predicate alongside special tokens to predict objects. (d) On top of pretrained LMs, detectors are trained to detect the presence of hallucinations during generation.
Figure 2: Hallucination rate per LM training FLOPs on examples seen (top) and unseen (bottom) during training, on small (left) and large (right) size of data. Each dot is an independent training run with learning rate schedule adjusted to the training length (\ref{['sec:pretraining']}). Dots correspond to [1, 2, 10, 20, 100, 200] epochs for 1% Data, and [1, 2, 10, 20] epochs for 10% Data. For a fixed dataset, the more FLOPs, the lower the hallucination. In contrast to established scaling laws for loss on text kaplan2020scalinghoffmann2022training, performance actually worsens with dataset size (top left vs. top right), as the larger dataset requires learning more facts. Training for 20+ epochs is necessary to minimise hallucinations on seen data (top), but can lead to overfitting to unseen data (bottom), presenting a trade-off between fact recall and ability to generalize. This is even more pronounced at $\text{temp}=0.0$ (\ref{['fig:hall_per_flop_t0']}). The hallucination rate upticks for 113M and 404M LMs on 1% Data are not mirrored by the training loss (\ref{['fig:loss_per_flop']}), i.e., they are not due to loss divergence.
Figure 3: Training loss per LM training FLOPs. The y-axis shows per-token average autoregressive cross-entropy loss over all tokens (i.e., subject, predicate, and object). Complementing \ref{['fig:hall_per_flop_t1']}, larger models attain smaller loss at a fixed dataset size, but the loss increases when training set size grows.
Figure 4: Precision and recall as function of temperature on the 1% Data. Marker size represents the number of non-embedding parameters. Marker type the number of epochs for which the LM was trained. For each temperature in [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], we generate 16 object predictions for every subject-predicate, and evaluate precision and recall against the valid completions. The reported numbers are average over all examples. Lower temperatures yield higher precision, higher temperatures yield higher recall.
Figure 5: Hallucination detection accuracy as a function of the LM size for various task formulations and detector types. Detectors were trained and evaluated on distinct splits of data obtained by having a given pretrained LM generate 5 completions for every subject-predicate in its training set (using $\text{temp} = 1.0$). The accuracy of all the trained hallucination detectors is generally high, especially for outputs of the larger LMs. Larger (full) detectors work better than smaller ones (head). The token-level detection task formulation seems to provide higher detection accuracy, although not in all cases. The results here are confounded by the varying hallucination rates of the underlying LM (e.g., if the LM hallucinates only 5% of the time, a detector which finds no hallucinations achieves 95% accuracy).
...and 6 more figures

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

TL;DR

Abstract

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

Authors

TL;DR

Abstract

Table of Contents

Figures (11)