Table of Contents
Fetching ...

GRS-QA -- Graph Reasoning-Structured Question Answering Dataset

Anish Pahilajani, Devasha Trivedi, Jincen Shuai, Khin S. Yone, Samyak Rajesh Jain, Namyong Park, Ryan A. Rossi, Nesreen K. Ahmed, Franck Dernoncourt, Yu Wang

TL;DR

This work introduces the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs and reveals that LLMs perform differently when handling questions with varying reasoning structures.

Abstract

Large Language Models (LLMs) have excelled in multi-hop question-answering (M-QA) due to their advanced reasoning abilities. However, the impact of the inherent reasoning structures on LLM M-QA performance remains unclear, largely due to the absence of QA datasets that provide fine-grained reasoning structures. To address this gap, we introduce the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs. Unlike existing M-QA datasets, where different reasoning structures are entangled together, GRS-QA explicitly captures intricate reasoning pathways by constructing reasoning graphs, where nodes represent textual contexts and edges denote logical flows. These reasoning graphs of different structures enable a fine-grained evaluation of LLM reasoning capabilities across various reasoning structures. Our empirical analysis reveals that LLMs perform differently when handling questions with varying reasoning structures. This finding facilitates the exploration of textual structures as compared with semantics.

GRS-QA -- Graph Reasoning-Structured Question Answering Dataset

TL;DR

This work introduces the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs and reveals that LLMs perform differently when handling questions with varying reasoning structures.

Abstract

Large Language Models (LLMs) have excelled in multi-hop question-answering (M-QA) due to their advanced reasoning abilities. However, the impact of the inherent reasoning structures on LLM M-QA performance remains unclear, largely due to the absence of QA datasets that provide fine-grained reasoning structures. To address this gap, we introduce the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs. Unlike existing M-QA datasets, where different reasoning structures are entangled together, GRS-QA explicitly captures intricate reasoning pathways by constructing reasoning graphs, where nodes represent textual contexts and edges denote logical flows. These reasoning graphs of different structures enable a fine-grained evaluation of LLM reasoning capabilities across various reasoning structures. Our empirical analysis reveals that LLMs perform differently when handling questions with varying reasoning structures. This finding facilitates the exploration of textual structures as compared with semantics.

Paper Structure

This paper contains 23 sections, 25 figures, 10 tables.

Figures (25)

  • Figure 1: Reasoning graphs constructed based on one QA instance from HotpotQA dataset yang2018hotpotqa that maps out the logical steps required to arrive at the answer. The left-hand side illustrates the positive reasoning graph, which is constructed from the supporting paragraphs provided in the original dataset. This graph represents the gold reasoning path needed to answer the question. On the right-hand side, two types of negative reasoning graphs are derived from the original positive reasoning graphs by either perturbing the edges (e.g., inversing the edge direction in this case) or adding additional nodes with irrelevant sentences.
  • Figure 2: Statistical Analysis of the Distribution of GRS-QA.
  • Figure 3: Comparison of BM25, TFIDF, and DPR Recall and Weighted Recall Across Question Types
  • Figure 4: LLM Judge Scores by Question Type for Different LLMs
  • Figure 5: LLM Judge Scores by Hop Type for Different LLMs
  • ...and 20 more figures