Table of Contents
Fetching ...

Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark

Jian Wu, Linyi Yang, Zhen Wang, Manabu Okumura, Yue Zhang

TL;DR

CofCA introduces a step-wise counterfactual MHQA benchmark to rigorously assess LLM reasoning over multiple evidence sources while decoupling internal memory from contextual reasoning. The framework combines counterfactual passage rewriting with sub-question generation and a suite of evaluation metrics (sub-QA, reasoning-chain, and joint performance) to reveal true reasoning capabilities and data contamination effects. Experiments across multiple LLMs show a substantial gap between factual and counterfactual data, with inflated performance driven by incorrect reasoning chains, and demonstrate that incorporating sub-questions improves stepwise reasoning. The work provides a reproducible protocol and highlights the need for reliable knowledge editing and robust, memory-agnostic evaluation for trustworthy LLMs.

Abstract

While Large Language Models (LLMs) excel in question-answering (QA) tasks, their real reasoning abilities on multiple evidence retrieval and integration on Multi-hop QA tasks remain less explored. Firstly, LLMs sometimes generate answers that rely on internal memory rather than retrieving evidence and reasoning in the given context, which brings concerns about the evaluation quality of real reasoning abilities. Although previous counterfactual QA benchmarks can separate the internal memory of LLMs, they focus solely on final QA performance, which is insufficient for reporting LLMs' real reasoning abilities. Because LLMs are expected to engage in intricate reasoning processes that involve evidence retrieval and answering a series of sub-questions from given passages. Moreover, current factual Multi-hop QA (MHQA) benchmarks are annotated on open-source corpora such as Wikipedia, although useful for multi-step reasoning evaluation, they show limitations due to the potential data contamination in LLMs' pre-training stage. To address these issues, we introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data that reveals LLMs' real reasoning abilities on multi-step reasoning and reasoning chain evaluation. Our experimental results reveal a significant performance gap of several LLMs between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks. Moreover, we observe that LLMs usually bypass the correct reasoning chain, showing an inflated multi-step reasoning performance. We believe that our CofCA benchmark will enhance and facilitate the evaluations of trustworthy LLMs.

Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark

TL;DR

CofCA introduces a step-wise counterfactual MHQA benchmark to rigorously assess LLM reasoning over multiple evidence sources while decoupling internal memory from contextual reasoning. The framework combines counterfactual passage rewriting with sub-question generation and a suite of evaluation metrics (sub-QA, reasoning-chain, and joint performance) to reveal true reasoning capabilities and data contamination effects. Experiments across multiple LLMs show a substantial gap between factual and counterfactual data, with inflated performance driven by incorrect reasoning chains, and demonstrate that incorporating sub-questions improves stepwise reasoning. The work provides a reproducible protocol and highlights the need for reliable knowledge editing and robust, memory-agnostic evaluation for trustworthy LLMs.

Abstract

While Large Language Models (LLMs) excel in question-answering (QA) tasks, their real reasoning abilities on multiple evidence retrieval and integration on Multi-hop QA tasks remain less explored. Firstly, LLMs sometimes generate answers that rely on internal memory rather than retrieving evidence and reasoning in the given context, which brings concerns about the evaluation quality of real reasoning abilities. Although previous counterfactual QA benchmarks can separate the internal memory of LLMs, they focus solely on final QA performance, which is insufficient for reporting LLMs' real reasoning abilities. Because LLMs are expected to engage in intricate reasoning processes that involve evidence retrieval and answering a series of sub-questions from given passages. Moreover, current factual Multi-hop QA (MHQA) benchmarks are annotated on open-source corpora such as Wikipedia, although useful for multi-step reasoning evaluation, they show limitations due to the potential data contamination in LLMs' pre-training stage. To address these issues, we introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data that reveals LLMs' real reasoning abilities on multi-step reasoning and reasoning chain evaluation. Our experimental results reveal a significant performance gap of several LLMs between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks. Moreover, we observe that LLMs usually bypass the correct reasoning chain, showing an inflated multi-step reasoning performance. We believe that our CofCA benchmark will enhance and facilitate the evaluations of trustworthy LLMs.
Paper Structure (33 sections, 3 equations, 5 figures, 13 tables)

This paper contains 33 sections, 3 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Differences between (a) and(b). The words in red are the pieces of evidence for the given questions. The differences between factual MHQA (c) and counterfactual MHQA (d). c: we input a factual MHQA with and without context into ChatGPT. ChatGPT could output the correct answer based on its internal memory regardless of the context. d: When inputting a counterfactual QA, where the passage is rewritten from the existing factual passage in c (words in the red), ChatGPT cannot rely on its memory and must reason on the given context, deeming that counterfactual QA can decouple LLMs' internal memory and reasoning abilities.
  • Figure 2: The framework of our LLM automatic data annotation pipeline. From left to right, A: we first ask LLM to act as a passage annotator to replace the keywords and paraphrasing. Then we manually ensure that the correctness of grammar and the key information have been changed. We send the reviewed high-quality data to GPT-4 to generate QA pairs and manually check the quality. B: After receiving the reviewed high-quality counterfactual QA data, we evaluate LLMs on generated data to test their inherent ability on MHQA.
  • Figure 3: The performance change of F$_1$ score and EM scores when answering 2 sub-questions on the 2-hop dataset.
  • Figure 4: The performance change of EM and F1 scores when answering from 2 hop questions to 4 hop questions.
  • Figure 5: Performance gap between Wikipedia-based factual multi-hop QA datasets and our 2-hop, 3-hop, and 4-hop counterfactual MHQA data of table \ref{['Wiki_exp']} and table \ref{['CofCA_exp']}. The line charts reveal that LLMs show an obvious performance gap between previous datasets and CofCA.