Table of Contents
Fetching ...

On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data

Alfredo Garrachón Ruiz, Tomás de la Rosa, Daniel Borrajo

TL;DR

This work tackles temporal question answering (TQA) for data that LLMs have not seen during training by framing TQA as a system-level problem that couples LLMs with external, deterministic tools. It identifies 17 common temporal-question types, builds the RATA dataset with semi-structured anonymized data to emphasize reasoning over prior knowledge, and compares multiple reasoning paradigms including CoT, ToT, reflexion, external execution, and API-based approaches. The study finds that purely internal LLM reasoning struggles on complex temporal tasks, while external execution via a predefined function API (CoTAPI) delivers the best accuracy (up to 93%), and CoTE offers strong zero-shot performance for unknown tasks; together they point to integrated, tool-assisted architectures for scalable TQA. Additionally, the work demonstrates a temporal-confidence mechanism enabling an LLM agent to distinguish temporal questions from knowledge-based ones, enabling targeted deployment in temporal AI systems.

Abstract

The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We not only develop a direct LLM pipeline, but also compare various methodologies and conduct an in-depth analysis. We identified and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components. To assess LLM performance, we created the \textit{Reasoning and Answering Temporal Ability} dataset (RATA), featuring semi-structured anonymized data to ensure reliance on reasoning rather than on prior knowledge. We compared several methodologies, involving SoTA techniques such as Tree-of-Thought, self-reflexion and code execution, tuned specifically for this scenario. Our results suggest that achieving scalable and reliable solutions requires more than just standalone LLMs, highlighting the need for integrated approaches.

On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data

TL;DR

This work tackles temporal question answering (TQA) for data that LLMs have not seen during training by framing TQA as a system-level problem that couples LLMs with external, deterministic tools. It identifies 17 common temporal-question types, builds the RATA dataset with semi-structured anonymized data to emphasize reasoning over prior knowledge, and compares multiple reasoning paradigms including CoT, ToT, reflexion, external execution, and API-based approaches. The study finds that purely internal LLM reasoning struggles on complex temporal tasks, while external execution via a predefined function API (CoTAPI) delivers the best accuracy (up to 93%), and CoTE offers strong zero-shot performance for unknown tasks; together they point to integrated, tool-assisted architectures for scalable TQA. Additionally, the work demonstrates a temporal-confidence mechanism enabling an LLM agent to distinguish temporal questions from knowledge-based ones, enabling targeted deployment in temporal AI systems.

Abstract

The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We not only develop a direct LLM pipeline, but also compare various methodologies and conduct an in-depth analysis. We identified and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components. To assess LLM performance, we created the \textit{Reasoning and Answering Temporal Ability} dataset (RATA), featuring semi-structured anonymized data to ensure reliance on reasoning rather than on prior knowledge. We compared several methodologies, involving SoTA techniques such as Tree-of-Thought, self-reflexion and code execution, tuned specifically for this scenario. Our results suggest that achieving scalable and reliable solutions requires more than just standalone LLMs, highlighting the need for integrated approaches.

Paper Structure

This paper contains 32 sections, 5 figures, 7 tables, 5 algorithms.

Figures (5)

  • Figure 1: Diagram with the different techniques proposed.
  • Figure 2: Execution time vs number of tokens.
  • Figure 3: Percentage of false predictions per technique as the size of the data (tokens) increases.
  • Figure 4: Accuracy grouped by algorithms.
  • Figure 5: Accuracy grouped by the response type.