MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge

Jie He; Nan Hu; Wanqiu Long; Jiaoyan Chen; Jeff Z. Pan

MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge

Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, Jeff Z. Pan

TL;DR

MINTQA introduces two large-scale, knowledge-type-diverse benchmarks (MINTQA-pop and MINTQA-ti) to rigorously evaluate LLMs on multi-hop question answering that involves new and tail knowledge. By mining Wikidata facts and generating questions with GPT-4o, the authors create 17,887 MINTQA-pop and 10,479 MINTQA-ti samples, each with sub-questions to diagnose intermediate reasoning and retrieval behavior. Across 22 state-of-the-art LLMs, results reveal pronounced challenges in knowledge-newness and multi-hop reasoning, with accuracy dropping as hop count increases and retrieval effectiveness varying by knowledge type. The work further probes integration strategies—decomposition with retrieval and dynamic retrieval guided by confidence—and provides upper bounds using gold components, highlighting substantial room for improving multi-hop QA with knowledge-grounded reasoning. Overall, MINTQA offers detailed diagnostics and a roadmap for advancing multi-hop QA in realistic, knowledge-changing settings.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks but face significant challenges with complex, knowledge-intensive multi-hop queries, particularly those involving new or long-tail knowledge. Existing benchmarks often fail to fully address these challenges. To bridge this gap, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a comprehensive benchmark to evaluate LLMs' capabilities in multi-hop reasoning across four critical dimensions: question handling strategy, sub-question generation, retrieval-augmented generation, and iterative or dynamic decomposition and retrieval. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge, with each question equipped with corresponding sub-questions and answers. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries, particularly in handling new or unpopular knowledge. Our findings highlight critical challenges and offer insights for advancing multi-hop reasoning capabilities. The MINTQA benchmark is available at https://github.com/probe2/multi-hop/.

MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge

TL;DR

Abstract

MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)