Table of Contents
Fetching ...

TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables

Abhilash Shankarampeta, Harsh Mahajan, Tushar Kataria, Dan Roth, Vivek Gupta

TL;DR

TransientTables targets a core gap in LLM temporal reasoning by evaluating how models reason over temporally evolving, entity-centric infobox tables. The authors construct a large-scale dataset of 3,971 QA pairs drawn from 14,133 tables across 1,238 entities, and they introduce a template-based QA pipeline plus a multi-stage task decomposition to improve grounding and reasoning. Across extensive experiments with multiple models and prompting regimes, they show substantial room for improvement relative to humans, with decomposition, larger context and fine-tuning yielding notable gains. The work demonstrates the limits of current LLMs on temporal, multi-table reasoning and provides a principled framework and benchmarks to push forward temporal reasoning in NLP applications.

Abstract

Humans continuously make new discoveries, and understanding temporal sequence of events leading to these breakthroughs is essential for advancing science and society. This ability to reason over time allows us to identify future steps and understand the effects of financial and political decisions on our lives. However, large language models (LLMs) are typically trained on static datasets, limiting their ability to perform effective temporal reasoning. To assess the temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES dataset, which comprises 3,971 questions derived from over 14,000 tables, spanning 1,238 entities across multiple time periods. We introduce a template-based question-generation pipeline that harnesses LLMs to refine both templates and questions. Additionally, we establish baseline results using state-of-the-art LLMs to create a benchmark. We also introduce novel modeling strategies centered around task decomposition, enhancing LLM performance.

TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables

TL;DR

TransientTables targets a core gap in LLM temporal reasoning by evaluating how models reason over temporally evolving, entity-centric infobox tables. The authors construct a large-scale dataset of 3,971 QA pairs drawn from 14,133 tables across 1,238 entities, and they introduce a template-based QA pipeline plus a multi-stage task decomposition to improve grounding and reasoning. Across extensive experiments with multiple models and prompting regimes, they show substantial room for improvement relative to humans, with decomposition, larger context and fine-tuning yielding notable gains. The work demonstrates the limits of current LLMs on temporal, multi-table reasoning and provides a principled framework and benchmarks to push forward temporal reasoning in NLP applications.

Abstract

Humans continuously make new discoveries, and understanding temporal sequence of events leading to these breakthroughs is essential for advancing science and society. This ability to reason over time allows us to identify future steps and understand the effects of financial and political decisions on our lives. However, large language models (LLMs) are typically trained on static datasets, limiting their ability to perform effective temporal reasoning. To assess the temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES dataset, which comprises 3,971 questions derived from over 14,000 tables, spanning 1,238 entities across multiple time periods. We introduce a template-based question-generation pipeline that harnesses LLMs to refine both templates and questions. Additionally, we establish baseline results using state-of-the-art LLMs to create a benchmark. We also introduce novel modeling strategies centered around task decomposition, enhancing LLM performance.

Paper Structure

This paper contains 37 sections, 1 figure, 21 tables.

Figures (1)

  • Figure 1: Example of Transient Information in Tables. This example of the Indian Cricket Team presents three tables sampled at different time points: 2017, 2020, and 2023. It clearly illustrates how certain values, such as Captain, ICC ranking, Tests played , change over time. However, inconsistencies exist in the tables, including missing keys and incorrect values, such as the test status acquired field, as noted in khincha-etal-2023-infosync. In this work, we are only focusing on transient (or temporally changing) information.