Temporally Consistent Factuality Probing for Large Language Models

Ashutosh Bajpai; Aaryan Goyal; Atif Anwer; Tanmoy Chakraborty

Temporally Consistent Factuality Probing for Large Language Models

Ashutosh Bajpai, Aaryan Goyal, Atif Anwer, Tanmoy Chakraborty

TL;DR

This work introduces TeCFaP, a temporally consistent factuality probe for LLMs, together with the TEMP-COFAC dataset that encodes temporal subject-relation-object sequences across $1526$ to $2022$. It extends factuality and consistency metrics to the temporal dimension and presents CoTSeLF, a framework that combines multi-task instruction-tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to boost temporally consistent factuality. Experimental results show that off-the-shelf LLMs perform poorly on TeCFaP, while CoTSeLF yields substantial improvements over strong baselines, including discrete and smooth CTSRL variants. The work advances time-aware knowledge extraction for LLMs and has practical implications for domains requiring reliable temporal reasoning, such as healthcare and law.

Abstract

The prolific use of Large Language Models (LLMs) as an alternate knowledge base requires them to be factually consistent, necessitating both correctness and consistency traits for paraphrased queries. Recently, significant attempts have been made to benchmark datasets and metrics to evaluate LLMs for these traits. However, structural simplicity (subject-relation-object) and contemporary association in their query formulation limit the broader definition of factuality and consistency. In this study, we introduce TeCFaP, a novel Temporally Consistent Factuality Probe task to expand the consistent factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC, a high-quality dataset of prefix-style English query paraphrases. Subsequently, we extend the definitions of existing metrics to represent consistent factuality across temporal dimension. We experiment with a diverse set of LLMs and find most of them performing poorly on TeCFaP. Next, we propose a novel solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining multi-task instruction tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to improve temporally consistent factuality in LLMs. Our experiments demonstrate the efficacy of CoTSeLF over several baselines.

Temporally Consistent Factuality Probing for Large Language Models

TL;DR

This work introduces TeCFaP, a temporally consistent factuality probe for LLMs, together with the TEMP-COFAC dataset that encodes temporal subject-relation-object sequences across

. It extends factuality and consistency metrics to the temporal dimension and presents CoTSeLF, a framework that combines multi-task instruction-tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to boost temporally consistent factuality. Experimental results show that off-the-shelf LLMs perform poorly on TeCFaP, while CoTSeLF yields substantial improvements over strong baselines, including discrete and smooth CTSRL variants. The work advances time-aware knowledge extraction for LLMs and has practical implications for domains requiring reliable temporal reasoning, such as healthcare and law.

Abstract

Paper Structure (25 sections, 4 equations, 13 figures, 11 tables)

This paper contains 25 sections, 4 equations, 13 figures, 11 tables.

Introduction
The TEMP-COFAC Dataset
TeCFaP Task Structure
Consistent-Time-Sensitive Learning Framework (CoTSeLF)
Experimental Results
Error Analysis
Related Work
Conclusion
Limitations
Ethics Statement
Appendix
Extended Description for TEMP-COFAC
Extended Results
ICL Setting Cont.
Closed Vocabulary Setting Cont.
...and 10 more sections

Figures (13)

Figure 1: Symbolic representation of the TeCFaP objective. An entity key_object holds a temporal relationship with another entity value_object via a subject-relation pair in either direction -- forward or backward.
Figure 2: The architectural framework of TEMP-COFAC -- (1) a set of diverse subject-relation pairs, (2) a sequence of entities which are temporally connected via a given subject-relation pair, (3) a set of paraphrase templates with a placeholder for key_object and value_object developed from subject-relation pairs, and (4) a closed vocabulary candidate set developed from possible entity space for a given subject-relation pair.
Figure 3: An instruction-based sample from training data for MT-IT model. Task k1: Generative sentence completion; Task k2: Binary paraphrase prediction.
Figure 4: Results for temporally consistent factuality (Temp-cons-fact) in $k$-shot ($k$=1,2,3) ICL setup with LLaMA[13B] in an open vocabulary setting.
Figure 5: Average temporally consistent factuality and temporal factuality (second y-axis) in an open vocabulary and two-shot setting across temporal bins of Entities (bin size: $10$ years) with LLaMA[13B].
...and 8 more figures

Temporally Consistent Factuality Probing for Large Language Models

TL;DR

Abstract

Temporally Consistent Factuality Probing for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)