Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

Qingyu Tan; Hwee Tou Ng; Lidong Bing

Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

Qingyu Tan, Hwee Tou Ng, Lidong Bing

TL;DR

Experimental results show that the proposed Complex-TR dataset is able to improve LLMs' performance on temporal QA benchmarks by significant margins and a novel data augmentation strategy to improve the complex temporal reasoning capability and robustness of LLMs is proposed.

Abstract

Knowledge in the real world is being updated constantly. However, it is costly to frequently update large language models (LLMs). Therefore, it is crucial for LLMs to understand the concept of temporal knowledge. However, prior works on temporal question answering (TQA) did not emphasize multi-answer and multi-hop types of temporal reasoning. In this paper, we propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning. Besides, we also propose a novel data augmentation strategy to improve the complex temporal reasoning capability and robustness of LLMs. We conducted experiments on multiple temporal QA datasets. Experimental results show that our method is able to improve LLMs' performance on temporal QA benchmarks by significant margins. Our code and data are released at: https://github.com/nusnlp/complex-tr.

Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

TL;DR

Abstract

Paper Structure (26 sections, 5 equations, 3 figures, 14 tables)

This paper contains 26 sections, 5 equations, 3 figures, 14 tables.

Introduction
Our Dataset
Methodology
Pseudo-Instruction Tuning
Context Refinement
Experiments
Experimental Setup
Evaluation Metrics
Experimental Results
Analysis
Robustness of Temporal Reasoning
Analysis of Multi-Answer Questions
Related Work
Conclusions
Limitations
...and 11 more sections

Figures (3)

Figure 1: An example of a 3-hop temporal expression for $t_{3}$. The temporal expressions are highlighted in yellow in the paragraph. The temporal expressions include exact timestamps and time intervals. This example is taken from Elon Musk's Wikipedia page on 18 June 2023.
Figure 2: Examples of GPT-4's erroneous temporal reasoning in the ReasonQA setting.
Figure 3: Annotation interface for the human verification process. Annotators are only asked to give True or False labels to the QA pairs and their contexts.

Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

TL;DR

Abstract

Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)