HOLMES: Hyper-Relational Knowledge Graphs for Multi-hop Question Answering using LLMs
Pranoy Panda, Ankush Agarwal, Chaitanya Devaguptapu, Manohar Kaul, Prathosh A P
TL;DR
The paper tackles MHQA on unstructured text by enabling LLMs with a query-aware hyper-relational KG distilled from supporting documents, reducing noise and token count. HOLMES follows a training-free, zero-shot pipeline with three components: query-dependent knowledge discovery, a query-aligned knowledge schema for refinement, and reader prompt construction that verbalizes the distilled facts. It achieves state-of-the-art results on HotpotQA and MuSiQue across multiple LLMs while reducing input tokens and maintaining high semantic and human-evaluated quality. The approach enhances context grounding for MHQA and offers a scalable, education-friendly solution with potential for broader domains and further efficiency gains, despite increased upfront computation for auxiliary schema creation and potential incompleteness in extracted graphs.
Abstract
Given unstructured text, Large Language Models (LLMs) are adept at answering simple (single-hop) questions. However, as the complexity of the questions increase, the performance of LLMs degrade. We believe this is due to the overhead associated with understanding the complex question followed by filtering and aggregating unstructured information in the raw text. Recent methods try to reduce this burden by integrating structured knowledge triples into the raw text, aiming to provide a structured overview that simplifies information processing. However, this simplistic approach is query-agnostic and the extracted facts are ambiguous as they lack context. To address these drawbacks and to enable LLMs to answer complex (multi-hop) questions with ease, we propose to use a knowledge graph (KG) that is context-aware and is distilled to contain query-relevant information. The use of our compressed distilled KG as input to the LLM results in our method utilizing up to $67\%$ fewer tokens to represent the query relevant information present in the supporting documents, compared to the state-of-the-art (SoTA) method. Our experiments show consistent improvements over the SoTA across several metrics (EM, F1, BERTScore, and Human Eval) on two popular benchmark datasets (HotpotQA and MuSiQue).
