How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

Junhong Lin; Bing Zhang; Song Wang; Ziyan Liu; Dan Gutfreund; Julian Shun; Yada Zhu

How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

Junhong Lin, Bing Zhang, Song Wang, Ziyan Liu, Dan Gutfreund, Julian Shun, Yada Zhu

TL;DR

HybridRAG-Bench introduces a contamination-aware benchmarking framework for evaluating retrieval-intensive, multi-hop reasoning over hybrid knowledge (unstructured text and knowledge graphs). It constructs time-framed corpora and domain-specific knowledge graphs from recent arXiv literature, generates QA pairs grounded in explicit reasoning paths, and automatically validates QA quality to ensure retrieval-based evaluation. Experimental results across AI, governance, and bioinformatics domains show that retrieval and graph-based reasoning yield substantial gains beyond LLM-only prompts and that hybrid KG-RAG methods outperform text-based retrieval, with detailed diagnostics by question type. The framework provides scalable, reproducible infrastructure for evaluating knowledge-augmented reasoning systems in evolving knowledge domains, with implications for fairer benchmarking and more robust RAG/KG-RAG systems.

Abstract

Large language models (LLMs) continue to struggle with knowledge-intensive questions that require up-to-date information and multi-hop reasoning. Augmenting LLMs with hybrid external knowledge, such as unstructured text and structured knowledge graphs, offers a promising alternative to costly continual pretraining. As such, reliable evaluation of their retrieval and reasoning capabilities becomes critical. However, many existing benchmarks increasingly overlap with LLM pretraining data, which means answers or supporting knowledge may already be encoded in model parameters, making it difficult to distinguish genuine retrieval and reasoning from parametric recall. We introduce HybridRAG-Bench, a framework for constructing benchmarks to evaluate retrieval-intensive, multi-hop reasoning over hybrid knowledge. HybridRAG-Bench automatically couples unstructured text and structured knowledge graph representations derived from recent scientific literature on arXiv, and generates knowledge-intensive question-answer pairs grounded in explicit reasoning paths. The framework supports flexible domain and time-frame selection, enabling contamination-aware and customizable evaluation as models and knowledge evolve. Experiments across three domains (artificial intelligence, governance and policy, and bioinformatics) demonstrate that HybridRAG-Bench rewards genuine retrieval and reasoning rather than parametric recall, offering a principled testbed for evaluating hybrid knowledge-augmented reasoning systems. We release our code and data at github.com/junhongmit/HybridRAG-Bench.

How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

TL;DR

Abstract

Paper Structure (30 sections, 5 equations, 3 figures, 9 tables)

This paper contains 30 sections, 5 equations, 3 figures, 9 tables.

Introduction
Related Works
Problem Definition and Preliminaries
Domains, Corpora, and Knowledge Graphs
Questions and Evaluation Scope
Task Definition
Benchmark Construction
Time-Framed Corpus Collection
Knowledge Graph Construction
Entity Extraction and Alignment.
Relation Normalization and Evidence Tracking.
Hybrid-Grounded Question--Answer Generation
Reasoning Path Sampling.
Hybrid Question Construction.
QA Pairs Quality Control
...and 15 more sections

Figures (3)

Figure 1: Illustration of the HybridRAG-Bench benchmarking framework.
Figure 2: Token usage during KG construction plotted against corpus length. Token cost grows approximately linearly with input size due to proportional input tokens and a fixed number of extraction calls per document. The smooth scaling pattern confirms that EvoKG’s update cost remains predictable and stable across document sizes.
Figure 3: Per-document KG construction latency as a function of corpus length (in characters). Longer documents incur proportionally higher extraction time, and the trend follows the expected near-linear scaling dominated by LLM token processing.

How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

TL;DR

Abstract

How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

Authors

TL;DR

Abstract

Table of Contents

Figures (3)