Table of Contents
Fetching ...

TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

Maithili Kadam, Francis Ferraro

TL;DR

TAG-EQA introduces a prompting framework that injects structured causal event graphs into LLM inputs by verbalizing edges, enabling event-based QA without model fine-tuning. The method systematically explores nine configurations across three prompting strategies and three input modalities on multiple instruction-tuned LLMs, using the TORQUESTRA dataset. Results show that causal graphs improve accuracy on average (about $5\%$) with larger gains in zero-shot and chain-of-thought settings, and that graph-augmented prompting benefits specific reasoning categories such as causal and temporal inference. The work demonstrates the potential and limitations of structured prompt inputs for enhancing event reasoning in LLMs and outlines directions for robustness, automation, and broader applicability.

Abstract

Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.

TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

TL;DR

TAG-EQA introduces a prompting framework that injects structured causal event graphs into LLM inputs by verbalizing edges, enabling event-based QA without model fine-tuning. The method systematically explores nine configurations across three prompting strategies and three input modalities on multiple instruction-tuned LLMs, using the TORQUESTRA dataset. Results show that causal graphs improve accuracy on average (about ) with larger gains in zero-shot and chain-of-thought settings, and that graph-augmented prompting benefits specific reasoning categories such as causal and temporal inference. The work demonstrates the potential and limitations of structured prompt inputs for enhancing event reasoning in LLMs and outlines directions for robustness, automation, and broader applicability.

Abstract

Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.

Paper Structure

This paper contains 35 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Illustrative example from the TORQUESTRA dataset. Top: Narrative passage and a binary event--based question. Left: Annotated causal graph showing Enable and Block relations between events. Right: A step--by--step reasoning trace that follows the graph to support causal inference. Together, the graph and reasoning highlight how structured event relations enable models to answer questions that require indirect causal chaining.
  • Figure 2: Overview of our QA prompting pipeline for TAG + CoT configuration. From left to right: a narrative passage and associated causal graph are processed into a structured input. The causal graph is serialized into natural-language edges (yellow), and the original passage text is retained (blue). Prompt assembly combines task instructions, the text, the graph, in-context QA examples, and the main question into a single input to the instruction-tuned LLM (T5, QwQ, or GPT). The model produces both a yes/no answer and a step-by-step reasoning trace grounded in the causal structure (green).
  • Figure 3: Cluster-wise accuracy under the TAG configuration. Bars denote Zero— Text (blue), Few— Text (red), and CoT with TAG (yellow) across thirteen question types. Subfigures (a--c) report T5, QwQ, and GPT respectively. Text--only and Graph--only cluster results appear in Appendix Figures \ref{['fig:cluster_text']} and \ref{['fig:cluster_graph']}.
  • Figure 4: Cluster-wise Accuracy by Model and Prompting Strategy. Accuracy across thirteen question categories for each model (T5, QwQ, GPT) under three prompting strategies: Zero--Text (blue), Few--Text (red), and CoT with TAG input (yellow). QwQ and GPT benefit most from graph-augmented CoT prompting on structured categories such as causal, past, and temporal_conflict. T5 performs best with Few--shot but struggles to integrate structure and reasoning. All models show weaker performance on underspecified or speculative categories like possible and unknown.
  • Figure 5: Cluster-wise Accuracy by Model and Prompting Strategy. Accuracy across thirteen question categories for each model (T5, QwQ, GPT) under three prompting strategies: Zero--Text (blue), Few--Text (red), and CoT with TAG input (yellow). QwQ and GPT benefit most from graph-augmented CoT prompting on structured categories such as causal, past, and temporal_conflict. T5 performs best with Few--shot but struggles to integrate structure and reasoning. All models show weaker performance on underspecified or speculative categories like possible and unknown.