EpiK-Eval: Evaluation for Language Models as Epistemic Models

Gabriele Prato; Jerry Huang; Prasannna Parthasarathi; Shagun Sodhani; Sarath Chandar

EpiK-Eval: Evaluation for Language Models as Epistemic Models

Gabriele Prato, Jerry Huang, Prasannna Parthasarathi, Shagun Sodhani, Sarath Chandar

TL;DR

The paper addresses the gap in understanding how large language models consolidate knowledge across multiple training documents by treating them as epistemic models. It introduces EpiK-Eval, a benchmark built from 18 narrative tasks that contrasts unsegmented and segmented training data to isolate cross-document consolidation, using $M_U$ and $M_S$ fine-tuning on $D_U$ and $D_S$. Empirical results show substantial weaknesses in knowledge consolidation, with segmented training producing a clear Type I-like behavior across model sizes, despite scaling that improves recall and reduces hallucinations only conditionally. The work highlights limitations of current pretraining objectives for cross-document dependencies and outlines directions for training objective redesign, longer-context strategies, and scalable evaluation to advance robust, knowledge-consistent LLMs.

Abstract

In the age of artificial intelligence, the role of large language models (LLMs) is becoming increasingly central. Despite their growing prevalence, their capacity to consolidate knowledge from different training documents - a crucial ability in numerous applications - remains unexplored. This paper presents the first study examining the capability of LLMs to effectively combine such information within their parameter space. We introduce EpiK-Eval, a novel question-answering benchmark tailored to evaluate LLMs' proficiency in formulating a coherent and consistent knowledge representation from segmented narratives. Evaluations across various LLMs reveal significant weaknesses in this domain. We contend that these shortcomings stem from the intrinsic nature of prevailing training objectives. Consequently, we advocate for refining the approach towards knowledge consolidation, as it harbors the potential to dramatically improve their overall effectiveness and performance. The findings from this study offer insights for developing more robust and reliable LLMs. Our code and benchmark are available at https://github.com/chandar-lab/EpiK-Eval

EpiK-Eval: Evaluation for Language Models as Epistemic Models

TL;DR

and

fine-tuning on

and

. Empirical results show substantial weaknesses in knowledge consolidation, with segmented training producing a clear Type I-like behavior across model sizes, despite scaling that improves recall and reduces hallucinations only conditionally. The work highlights limitations of current pretraining objectives for cross-document dependencies and outlines directions for training objective redesign, longer-context strategies, and scalable evaluation to advance robust, knowledge-consistent LLMs.

Abstract

Paper Structure (28 sections, 23 figures, 22 tables)

This paper contains 28 sections, 23 figures, 22 tables.

Introduction
Epistemology & Language Models
EpiK-Eval
Dataset:
Evaluation Process:
Experiments
Are LMs Type I or Type II Systems?
In-Depth Answer Analysis
Recall:
Reasoning:
Final Answers:
Hallucinations
Effect of Scale
Related Work
Knowledge Representation:
...and 13 more sections

Figures (23)

Figure 1: When training on samples (red), Type I systems process each sequence independently, unable to discern their interrelations. Presented with a question (gray), they are unable to consolidate their knowledge and instead assign a probability to each fact when answering (green). In contrast, Type II Systems can learn these relationships and possess a unified knowledge state, allowing them to answer accurately.
Figure 2: Performance on EpiK-Eval, measuring accuracy as the percentage of correct answers. Models struggle to answer questions that require consolidating knowledge from multiple training documents (orange). In comparison, they perform much better if the same information can be found within a single document (blue).
Figure 3: Breakdown of model answers into three parts: story recall, reasoning and final answer. (Left) percentage of correct recalls. (Center) percentage of correct reasonings when recall is correct. (Right) percentage of correct final answers when recall and reasoning are correct or when recall is correct and task has no reasoning part. Recall performance is worse when models need to recollect information from multiple training documents (orange) versus from single documents (blue), but reasoning and final answer capabilities seem unaffected.
Figure 4: Model hallucination rate on the training set (left) and the test set (right). Models which need to recall information from multiple documents seen during training (orange) are more prone to hallucinations during testing than models which only need to recall information from a single training document (blue).
Figure 5: Task 1 results. Top left: percentage of correct answers. Top right: hallucination rate for both train and test sets. Bottom: percentage of correct recalls (left) and final answers (right).
...and 18 more figures

EpiK-Eval: Evaluation for Language Models as Epistemic Models

TL;DR

Abstract

EpiK-Eval: Evaluation for Language Models as Epistemic Models

Authors

TL;DR

Abstract

Table of Contents

Figures (23)